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Preface 



The warning could not have been meant for the place 
where it could only be found after approach. 
— Joseph Conrad, Heart of Darkness 



This solution manual was initially intended for instructors using the book, 
so that they could entirely rely on the exercises to provide graded homeworks. 
However, due to repeated criticisms and to requests from readers, as well as 
to the fact that some exercises were stepping stones for following sections and 
to the lack of evidence that the exercises were indeed used to set homeworks, 
we came to realise (albeit late in the day!) that some solutions were needed 
by some (self-study or not) readers. From there, the move to make the whole 
set of solutions available to all readers was a rather natural step. Especially 
when contemplating the incoming revision of Bayesian Core towards a Use R! 
oriented version, with an attempt at reducing the math complexity, another 
reproach found in some of the published criticisms. Therefore, lo and behold!, 
by popular request, the solution manual is now available for free use and 
duplication by anyone, not only by instructors, on the book webpage as well 
as on Springer Verlag's website. 

However, there is a caveat to the opening of the manual to all: since this 
solution manual was first intended (and written) for instructors, some self- 
study readers may come to the realisation that the solutions provided here 
are too sketchy for them because the way we wrote those solutions assumes 
some minimal familiarity with the maths, the probability theory and with the 
statistics behind the arguments. There is unfortunately a limit to the time 
and to the efforts we can put in this solution manual and studying Bayesian 
Core requires some prerequisites in maths (such as matrix algebra and Rie- 
mann integrals) , in probability theory (such as the use of joint and conditional 
densities) and some bases of statistics (such as the notions of inference, suf- 
ficiency and confidence sets) that we cannot cover here. Casella and Berger 
(2001) is a good reference in case a reader is lost with the "basic" concepts 



vi 

or sketchy math derivations. Indeed, we also came to realise that describing 
the book as "self-contained" was a dangerous add as readers were naturally 
inclined to always relate this term to their current state of knowledge, a bias 
resulting in inappropriate expectations. (For instance, some students unfortu- 
nately came to one of our short courses with no previous exposure to standard 
distributions like the t or the gamma distributions.) 

We obviously welcome comments and questions on possibly erroneous so- 
lutions, as well as suggestions for more elegant or more complete solutions: 
since this manual is distributed both freely and independently from the book, 
it can be updated and corrected [almost] in real time! Note however that the R 
codes given in the following pages are not optimised because we prefer to use 
simple and understandable codes, rather than condensed and efficient codes, 
both for time constraints (this manual took about a whole week of August 
2007 to complete) and for pedagogical purposes: the readers must be able to 
grasp the meaning of the R code with a minimum of effort since R program- 
ming is not supposed to be an obligatory entry to the book. In this respect, 
using R replaces the pseudo-code found in other books since it can be im- 
plemented as such but does not restrict understanding. Therefore, if you find 
better [meaning, more efficient/faster] codes than those provided along those 
pages, we would be glad to hear from you, but that does not mean that we will 
automatically substitute your R code for the current one, because readability 
is also an important factor. 

Sceaux &: Montpellier, France, October 26, 2009 
Christian P. Robert &: Jean-Michel Marin 
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Exercise 1.1 Given a function g on R, state the two basic conditions for g to be 
a probability density function (pdf) with respect to the Lebesgue measure. Recall 
the definition of the cumulative distribution function (cdf) associated with g and 
that of the quantile function of g. 



If g is integrable with respect to the Lebesgue measure, g is a pdf if and only 
if' 



1. g is non-negative, g(x) > 

2. g integrates to 1, 

g(x) dx = 1 . 



/ 

Jr 



Exercise 1.2 If (xi,x 2 ) is a normal M2), -27) random vector, with 



£ = 



(7 LOCT ' 
WOT T 2 



recall the conditions on (w,a, r) for 17 to be a (nonsingular) covariance matrix. 
Under those conditions, derive the conditional distribution of x 2 given x\. 



The matrix £ is a covariance matrix if 

1. £ is symmetric and this is the case; 

2. £ is semi-definite positive, i.e. , for every x £ M 2 , x T I7x > 0, or, for every 
(xi,x 2 ), 

a 1 x\ + 2uj(jtxxX2 + t 2 x\ = (ax\ + wti 2 ) 2 + t 2 x\{\ — u 2 ) > . 
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A necessary condition for U to be positive semi-definite is that dct(U) = 
<j 2 t 2 (1 — uo 2 ) > 0, which is equivalent to \u\ < 1. 

In that case, x T I7x > 0. The matrix S is furthermore nonsingular if det(Z') > 
0, which is equivalent to \u\ < 1. 

Under those conditions, the conditional distribution of x 2 given x\ is de- 
fined by 

f{x 2 \x 1 ) ocexpi {xi -/ii x 2 -n 2 )£~ 1 l Xl /il ]/2 
I \x 2 - Hi) 



oc cxp <^ (x\ ~n\ x 2 - ^ 2 ) 



_ —LUCTT 

-2/„ ., \2 



T 2 - 



cx cxp {cr 2 (a;2 - M2) 2 - 2lu<7t(x 1 - [ii)(x 2 - fi 2 )/2det(E)} 
^ 6XP 1 a 2 r 2 (l - u; 2 ) i* 2 ^ " T ^ 1 ^) /2 f 



Therefore, 



; X\ LL\ 2/1 2 \ 

x 2 |xi <~ =yK /x 2 + ujt ,r (1 - a; ) 



Exercise 1.3 Test the helpO command on the functions seq(), sampleO, 
and order (). (Hint: start with helpO.) 



Just type 

> helpO 

> help(seq) 

> help (sample) 

> help (order) 

and try illustrations like 

> x=seq(l,500,10) 

> y=sample(x, 10,rep=T) 

> z=order(y) 



Exercise 1.4 Study the properties of the R function lm() using simulated data 
as in 

> x=rnorm(20) 

> y=3*x+5+rnorm(20,sd=0.3) 

> reslm=lm(y~x) 

> summary (reslm) 
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Generating the normal vectors x and y and calling the linear regression func- 
tion Im leads to 

Call: 

lm(f ormula = y ~ x) 

Residuals : 

Min 1Q Median 3Q Max 

-0.7459 -0.2216 0.1535 0.2130 0.8989 

Coefficients : 

Estimate Std. Error t value Pr(>|t|) 
(Intercept) 5.02530 0.10283 48.87 <2e-16 *** 
x 2.98314 0.09628 30.98 <2e-16 *** 

Sig. code: '***' 0.001 '**' 0.01 0.05 '.' 0.1 ' ' 1 

Residual standard error: 0.4098 on 18 degrees of freedom 
Multiple R-Squared: 0.9816, Adjusted R-squared: 0.9806 

F-statistic: 960 on 1 and 18 DF, p-value : < 2.2e-16 

Therefore, in this experiment, the regression coefficients (a, (3) in E[y|x] = 
a + [3x are estimated by maximum likelihood as a — 2.98 and (5 — 5.03, while 
they are a = 3 and (i — 5 in the simulated dataset. 

Exercise 1.5 Of the R functions you have met so far, check which ones are 
written in R by simply typing their name without parentheses, as in mean or var. 



Since 

> mean 

function (x, . . . ) 
UseMethod ( "mean" ) 
Environment : namespace :base> 

and 

> var 

function (x, y = NULL, na.rm = FALSE, use) 
{ 

if (missing(use) ) 

use <- if (na.rm) 
"complete . obs" 

else "all. obs" 
na. method <- pmatch(use, cC'all.obs", "complete . obs" , 
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"pairwise . complete . obs" ) ) 
if (is.data.frame(x)) 

x <- as .matrix (x) 
else stopif not (is . atomic (x) ) 
if (is.data.frame(y)) 

y <- as.matrix(y) 
else stopif not (is. atomic (y)) 
. Internal (cov(x, y, na. method, FALSE)) 

} 

Environment : namespace : stats> 

we can deduce that the first function is written in C, while the second function 
is written in R. 
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Exercise 2.1 Check your current knowledge of the normal jY^, a 2 ) distribu- 
tion by writing down its density function and computing its first four moments. 



The density of the normal ,J/{[i,g 2 ) distribution is given by 
<p(x\n,a) = — exp {-(x - n) 2 /2<r 2 } 



and, if X <~ ^V{n, <r 2 ), 

f + °° T - II 

E[X]=fi+ — =^cxp{-(a;- fj,) 2 /2a 2 }dx 

J-oo v27TCr 

= [i + -= / y exp -y 2 /2dy 
V 2ir J-oo 



M+^[-exp-y72]^ 



2 / iJ/=+oo 



27T 

= M ! 

then, using one integration by parts, 

/■+00 2 

E[(X - M ) 2 ] = / -J=- exp - y 2 /2a 2 dy 

J-cso V £71(7 



-oo 



+oo 2 

2 , 



DC 



exp— z /2dz 



r+oo i 

[-zexp-z 2 /2]S^+^ 2 / -=cxp-z 2 /2dz 



v2ir J-oo v27r 

.2 
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exploiting the fact that (x — y(i) 3 exp {—(a; — (u) 2 /2cr 2 } is asymmetric wrt the 
vertical axis x = fx, 

r+oo 3 

E[(X M ) 3 ] = / -J=- exp -y 2 /2a 2 dy = 

J-oo V 27T(7 

and, using once again one integration by parts, 



4 f+oo 

E[(X - ^) 4 ] = / z 4 exp -z 2 /2 ( 

V27T J-oo 

= ^[-z 3 exp-z 2 /2]^-+a 4 / + °° exp -z 2 /2dz 

V 27T J-oo V 27T 



!dz 



= 3(T 4 . 

Thus, the four first (centered) moments of the normal cr 2 ) distribution 

are /i, a 2 , and 3c 4 . 



Exercise 2.2 Before exiting to the next page, think of datasets that could be, or 
could not be, classified as normal. In each case, describe your reason for proposing 
or rejecting a normal modeling. 



A good illustration of the opposition between normal and nonnormal mod- 
clings can be found in insurrance claims: for minor damages, the histogram 
of the data (or of its log-transform) is approximately normal, while, for the 
highest claims, the tail is very heavy and cannot be modeled by a normal 
distribution (but rather by an extreme value distribution). Take for instance 
http : //www . statsci . org/data/general/carinsuk . html 



Exercise 2.3 Reproduce the histogram of Figure 2.1 and the subsequent analy- 
sis conducted in this chapter for the relative changes in reported larcenies relative 
to the 1995 figures, using the 90cntycr.wkl file available on the Webpage of the 
book. 

A new datafile must be created out of the file 90cntycr.wkl. Then, plotting 
an histogram and doing inference on this dataset follows from the directions 
provided within the chapter. 

Exercise 2.4 By creating random subwindows of the region plotted in Figure 
2.2, represent the histograms of these subsamples and examine whether they 
strongly differ or not. Pay attention to the possible influence of the few "bright 
spots" on the image. 
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While a "random subwindow" is not anything clearly defined, we can cre- 
ate a 800 x 800 matrix by 

> cmb=matrix(scan("CMBdata") ,nrow=800) 
and define random subregions by 

> cmbl=cmb [sample (1 : 100, 1) : sample(101 :300, 1) , 
sample (50 : 150 , 1) : sample (401 : 600 , 1) ] 

> cmb2=cmb [sample (701: 750,1) : sample (751 : 800 , 1) , 
sample (650 : 750 , 1) : sample (751 : 800 , 1 ) ] 

Comparing the histograms can then be done as 

> hist (cmbl ,proba=T,xlim=range (cmb) ) 

> par(new=T) 

> hist (cmb2 ,proba=T,xlim=range (cmb) ) 

or, more elaborately, through nonparametric density estimates 

> cnpl=density (cmbl ,ad=3) # Smooth out the bandwith 

> cnp2=density(cmb2,ad=3) # Smooth out the bandwith 

> plot (cnpl ,xlim=range (cmb) ,type="l" , lwd=2 , col="tomato3" , 
main=" comparison" ) 

> lines(cnp2,lty=5,col="steelblue4" ,lwd=2) 

which leads to Figure |2.1| In that case, both subsamples are roughly normal 
but with different parameters. 



comparison 




-O.I O.O O.I 0.2 0.3 0.4 0.5 0.6 

N = 76436 Bandwidth - 0.01 71 7 



Fig. 2.1. Comparison of two density estimates for two random subregions. 



Exercise 2.5 Show that (2.2) can be derived by first setting 6 as a random 
variable with density function tt and then & conditionally on 6 as distributed 
from l{0\@). 
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If ir (9) is the density of the marginal distribution of 9 and then t{9\&) the 
density of the conditional distribution of Q! given 9, the density of the joint 
distribution of 9 and of ^ is given by 

l{6\@)it{6). 

Therefore, Bayes's theorem simply is the derivation of the density of the con- 
ditional distribution of 9 given S> from this joint density. 



Exercise 2.6 Show that the minimization (in 9(&))) of the expectation 
E[L((9, — that is, of the expectation of the quadratic loss function under 
the distribution with density ir(9\S>) — produces the posterior expectation as the 
solution in 9. 



Since 

E[L(0,0))\&] = E[\\9 ~ 9\\ 2 \3>] 

= E[{9 - 9) T {9 - 9)\@] 

= E[\\9\\ 2 -29 T 9+\\9\\ 2 \3>} 

= E[||6»|| 2 |^] -20 T E[6>|^] + \\9\\ 2 

= E[||#|| 2 |^] - ||E[0|^]|| 2 + \\E[6\3f\ ~9\\ 2 , 

minimising E[L(0,0))|i^] is equivalent to minimising ||E[6*|^] — 9\\ 2 and hence 
the solution is 

9 = E[0\S>] . 

Exercise 2.7 Show that the normal, binomial, geometric, Poisson, and expo- 
nential distributions are all exponential families. 



For each of those families of distributions, it is enough to achieve the 
standard form of exponential families 



fe(y) = h(y) exp{9-R(y)-V(9)} 



(2.1) 



as defined in the book. 

In the normal ^Y{9, 1) case, 



Mv) 



-L exp^ {- y 2 + 2y9-9 2 } 
1 2ir * 



and so it fits the representation (2.1 1 with R(y) — y, h(y) = exp(— y 2 /2)/v / 2~7r 
and &{0) = 9 2 /2. 
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In the binomial 33(ri,p) case, 



f P (y) = [ y ) ex P {y i°g(p) + (« - y) lo g(! - p)} , ye {o, 1, 



and it also fits the representation (2.1l with 9 = log(p/(l — p)), = y, 

%) = (;) and V(0) = -»log(l + eV 

In the geometric Sf(p) case [corresponding to the number of failures before 
a success], 

f P (y) = exp{ylog(l -p) +log(p)} , y = 0,1,..., 



and it also fits the representation (2.1 1 with 9 = log(l— p), R(y) — y, h(y) = 1 
and W(9) = -log(l - e e ). 
In the Poisson 3^(X) case, 

fx(y) = -7 exp{ylog(A) - A} 
2/! 



and it also fits the representation (2.1 1 with 6> = log(A), R{y) = y, h(y) = X/y\ 
and W(9) = exp((9). 

In the exponential S'xp(X) case, 

/ A (2/)=exp{-Ai/ + log(A)} 



and it also fits the representation (2.1 ) with 9 — A, R(y) = —y, h(y) = 1 and 
!P(0) = -log(0). 



Exercise 2.8 Show that, for an exponential family, ^(9) is defined by the con- 
straint that fg is a probability density and that the expectation of this distribution 
can be written as dfy(9)/d9, the vector of the derivatives of ^(9) with respect 
to the components of 9. 



Using the representation (2.1 1, 



/ fe(y)dy = J h{y) exp{0 • R(y) - V(0)} dy = 1 
implies that #(9) is uniquely defined by 

J h{y) exp{9-R(y)} dy = / h(y) dyexp {¥(6)} . 
When considering the expectation of R(Y), 
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Ee[R(Y)] = J R{y)h{y) exp {9 ■ R(y) - V(0)} dy 

= J We {e ■ R{y)} h{y) exp {e ■ R{v) ~ m} dy 



^ {9 ■ R(y) - ${B) + <P(6)} h(y) exp {0 ■ R(y) - W(9)} dy 



d&{6) 
89 



J h(y) exp{9-R(y)-&(9)}dy 
J h( y )-^[exp{9-R(y)-*(9)}] dy 



iw x 1 + We i / m exp{e ' R{y) ~ m} dy } 



d&{6) 1 ( a 

89 
8<P(9) 



89 ' 

Exercise 2.9 Show that the updated hyperparameters in (2.5) are given by 

ti'(y)=li + R(y), A'(y) = A + l. 
Find the corresponding expressions for 7r(#|£, X,yi, ■ . . , y n ). 



If we start with 

f g (y) = h(y)e X p{0-R(y)-¥(6)} , and n(0\£, A) cx exp {9 ■ £ - \>P(9)} , 

Bayes theorem implies that 

7r(0|e,A,y)cx/ e (y)7r(0|e,A) 

oc exp {9 ■ R(y) - &(6)} exp {9 ■ £ - X&(9)} 
= exp {0- + -(A + 1)^(0)} • 



t'(v)=Z + R(v), A'(y) = A + l. 



Therefore, 
Similarly, 

n 

Tr(9\Z,\, yi ,...,y n ) ocHf e ( yi )n(e\^X) 
»=i 

oc exp • R( yi ) - &{9)} | exp {9 • £ - A<F(0)} 

= exp { ■ £ flfo) + £] - (A + n)*(9) \ . 



i=i 
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Therefore, 

n 

£{yi, ■ ■ ■ ,Vn) = £, + ^2R(Vi) , A'(yi,...,j/„) = X + n. 



Exercise 2.10 erive the posterior distribution for an iid sample @ = 
(yi, . . . , y n ) from <j¥(Q, 1) and show that it only depends on the sufficient statistic 

V = EiLi Vi/ n - 



Since (see Exercice 2.7) 

Mv) = 



±= exp 1 - {- y 2 + 2y6-9 2 } 
1 2ir * 



fits the representation (2.1l with R(y) = y, h(y) = exp(— j/ 2 /2)/v / 2~7r and 
^(0) = 9 2 /2, a conjugate prior is 

^(0|£,A)cxexpi {2£9-\e 2 } , 

which is equivalent to a o/K(£/A, 1/A) prior distribution. Following the updat- 
ing formula given in Exercice |2.9[ the posterior distribution is a 



<^(£'{vi, ■ ■ ■ ,Vn)/X(yi, yn), l/\'{yi,. . . , y n )) 

distribution, i.e. 

, A/ ft + ny 1 

m m,- ■ - ,2/n ~ -T-, — . t- — 

\ A + n X + n 

It obviously only depends on the sufficient statistics y 



Exercise 2.11 Give the range of values of the posterior mean (2.6) as the pair 
(A, A _1 £) varies over R+ x R. 



While 

A- 1 

the fact that £ can take any value implies that this posterior mean has an un- 
restricted range, which can be seen as a drawback of this conjugate modeling. 
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Exercise 2.12 A Weibull distribution W(a,/3,j) is defined as the power trans- 
form of a gamma &(a,/3) distribution: if X ~ W(a,P,i), then X"> ~ ^(a,/3). 
Show that, when 7 is known, W(a,/3,'y) is an exponential family but that it is 
not an exponential family when 7 is unknown. 



The Weibull random variable X has the density 

7 a „(/3+l) 7 -l -^o 

r(/3) 

since the Jacobian of the change of variables y = x 7 is r yx y ~ 1 . So, checking 



the representation (2.1 1 leads to 

/(x|a, A 7) = exp {[(/? + 1)7 ~ 1] log(x) - ax 7 } , 

with R(x) = (7log(x),-x 7 ), 9 = (J3, a) and V(0) — log r{(3) — log 70^. 

If 7 is unknown, the term x 7 a in the exponential part makes it impossible 



to recover the representation (2.1 1 



Exercise 2.13 Show that, when the prior on 9 = (fi,a 2 ) is cr 2 /A A1 ) x 

(Ao-, a), the marginal prior on fx is a Student's t distribution 
T(2Ao., £, a/A„A CT ) (see Example 2.3 below for the definition of a Student's 
£ density). Give the corresponding marginal prior on a 2 . For an iid sample 
S = (xi, . . . , x n ) from a 2 ), derive the parameters of the posterior dis- 

tribution of {fi, a 2 ). 



Since the joint prior distribution of (/i, a 2 ) is 

7r(/i, a 2 ) cx (a 2 )-^- 1 - 1 / 2 exp ^ { A> - £) 2 + 2a} 

(given that the Jacobian of the change of variable to = a -2 is w~ 2 ), integrating 
out a 2 leads to 

f°° — 1 

tt(//) (x y (a 2 )- A - 3 / 2 exp — {A> - £) 2 + 2a} da 2 

coo _ 0J 

oc / cj a ^ 1/2 exp— {A M (^-£) 2 + 2a} dw 
oc{A Al ( M -e) 2 + 2a}- A ^ 1/2 



_ 2A JZL +l 

a A CT A,(/W) 



2A r 



a 
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which is the proper density of a Student's t distribution T(2A CT , £, a/\^\u). 

By definition of the joint prior on (/i, cr 2 ), the marginal prior on cr 2 is a 
inverse gamma y^(\ cr ,a) distribution. 

The joint posterior distribution of (^,cr 2 ) is 

tt((m, <t 2 )|^) cx (a 2 )-^^) exp {- (A„(0)(/x - £(^)) 2 + a(#)) /2a 2 } , 
with 

A CT (^) = A CT + 3/2 + n/2, 
A M (0) = A M + n, 
^) = (A„€ + ra)/A M (^), 

a(0) = 2a + - 2 + s 2 (®) . 

This is the product of a marginal inverse gamma 

^(A CT (^)-3/2,a(^)/2) 
distribution on a 2 by a conditional normal 

on /z. (Hence, we do get a conjugate prior.) Integrating out cr 2 leads to 

ir(fi\@) oc / (a 2 )~ X ' m cxp {- (A M (^)(/i - £(^)) 2 + /2a 2 } da 2 

Jo 



/ 

Jo 



OC / U) 



A„(®)-2 



exp {- (A^X/x - C(^)) 2 + a(^)) ^/2} dec 



oc (A M (^)( M -C(^)) 2 + «(^)) (A " W 1} , 
which is the generic form of a Student's t distribution. 



Exercise 2.14 Show that, for location and scale models, Jeffreys' prior is given 
by tt j (6) = 1 and n J (9) = 1/9, respectively. 



In the case of a location model, ,f(y\0) — p(y — 9), the Fisher information 
matrix of a location model is given by 



1(0) = E e 



8 log p(Y - 9) T 8 log p(Y - 6) 
86 09 

' dp{y '^My-9) d y 



dp(y 



89 



8p(z) 
8z 



n T 



dp(z)~ 
8z 



89 



/p(z) dz 
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it is indeed constant in 9. Therefore the determinant of 1(9) is also constant 
and Jeffreys' prior can be chosen as ir J (9) = 1 [or any other constant as long 
as the parameter space is not compact]. 

In the case of a scale model, if y ~ f(y/9)/9, a change of variable from 
y to z = log(y) [if y > 0] implies that 77 = log(0) is a location parameter for 
z. Therefore, the Jacobian transform of ir J (rj) = 1 is tt j (9) = 1/6. When y 
can take both negative and positive values, a transform of y into z = log ( 1 2/ 1 ) 
leads to the same result. 



Exercise 2.15 In the case of an exponential family, derive Jeffreys' prior in terms 
of the Hessian matrix of 'P(O), i.e. the matrix of second derivatives of ^(9). 



Using the representation (2.1 1 



logf e (y) = logh(y) + 0-R(y)-*(9), 

we get 

log My) - 



QQQQT QQQQT 

and therefore the Fisher information matrix is the Hessian matrix of <?"(#), 
H(9). This implies that tt j (6>) = det#(0). 



Exercise 2.16 Show that, when ir(9) is a probability density, (2.8) necessarily 
holds for all datasets S 1 . 



Given that ir(9) is a (true) probability density and that the likelihood 
l(9\Si) is also a (true) probability density in & that can be interpreted as a 
conditional density, the product 

ir{ff)l{9\Si) 

is a true joint probability density for (6,@). The above integral therefore 
defines the marginal density of £P, which is always defined. 



Exercise 2.17 Try to devise a parameterized model and an improper prior such 
that, no matter the sample size, the posterior distribution does not exist. (If you 
cannot find such a case, wait until Chapter 6.) 



It is sufficient to find a function of the parameter 9 that goes to infinity 
faster than the likelihood goes to 0, no matter what the sample size is. For 
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instance, take n(9) oc exp# 2 for a Cauchy ^(0,1) model. Then, for a sample 
of size n, the likelihood goes to as 9~ 2n and it cannot beat the exponential 
increase in the prior. 



Exercise 2.18 Show that, under the loss L ao ^ ai , the Bayes estimator associated 
with a prior tt is given by 

8*( x )=l 1 ' f P *( e >«i/ a o + ai, 

I otherwise. 



The posterior expected loss is 
E[L a0iai (e,d)\x} = \ 



a a P 7 '(9 G O \x) if d = 0, 
ai P*(6 G e x \x) if d=l, 



thus the decision minimising this posterior loss is d = 1 when ai P 71 " (9 e 
Oxlx) < a Q P 7 '(9 G O \x), i.e. 

ai(i - P"(0 e ©ok)) < «o ^ e ©ok) , 

and d = otherwise. 



Exercise 2.19 When G {6> ,#i}, show that the Bayesian procedure only de- 
pends on the ratio Qofg (x)/(1 — Qo)fe 1 {%), where g is the prior weight on O . 



In this special case, tt puts a point mass of go on O and of (1 — g ) on 0i. 
Therefore, 



F"{9 = 9 a \x)- gofe " {x) 



Qofe {x) + (1 - Qo)fe 1 {x) 
1 



i + a-ftO/^aOAi-eo)/*^) 

which only depends on the ratio gofe {x)/{l — Qo)fe 1 { x )- 



Exercise 2.20 Show that the limit of the posterior probability P 71 "^ < 0|ar) 
when £ goes to and r goes to oo is ${—x/o). 
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Since 



P"(n<0\x) =${-li{x)/u) 

(J> ( (T 2 ^ - T 2 X T 2 



= # 



a 2 + t 2 V a 2 T 2 
a 2 i + T 2 x 



when £ goes to and r goes to oo, the ratio 

cr 2 £ + r 2 x 



V <y 2 + t 2 V<j 2 t 2 



goes to 



lim 



t 2 x 



= lim 



r x x 



t^oo ^/ CT 2 _|_ T 2^/ (J 2 T 2 r^oo r 2 (T <T 



Exercise 2.21 We recall that the normalizing constant for a Student's 
T(is, /i, a 2 ) distribution is 

r((^ + i)/2)/i>/2) ^ 

Give the value of the integral in the denominator of _Bf above. 



We have 



(»-xf + ( l ,-y) 2 = 2[ti- 



x + y\ , 0~y) 2 



and thus 



J [(n-x) 2 + (»-y) 2 + S 2 Y n &v 



= 2~ n 

= (2a 2 )-" 
where v = 2n — 1 and 



x + yV (x-y) 2 S" 2 



+ 



4 + 2 



d^i 



i+U- 



-\ 2 c2 

x — y\ b 



ja 2 v 



-(u+D/2 



(2n- 1) . 
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Therefore, 



[(/i - xf + ( M - y) 2 + S 2 
= (2a 2 )-" 



d/i 



<j\/vir 



r((u + i)/2)/r(u/2) 

VK 



2 n a 2n - 1 r((u + l)/2)/r{v/2) 



(2n - 1) 



2n-l 



2™ 



(^) 2 + 



r((«/ + i)/2)/r(i//2) 



Note that this expression is used later in the simplified devivation of B^ x 



without the term (2n — l) z 



r/2 n r((v+l)/2)/r(u/2) because this term 



appears in both the numerator and the denominator. 



Exercise 2.22 Approximate Bq X by a Monte Carlo experiment where £ is simu- 
lated from a Student's t distribution with mean (x+y) /2 and appropriate variance, 
and the integrand is proportional to cxp — £ 2 /2. Compare the precision of the re- 
sulting estimator with the above Monte Carlo approximation based on the normal 
simulation. 



The integral of interest in Bq X is 
J [(2£ + x- yf/2 + S 2 } - n+1/2 e-« 2 / 2 d^/V^ 

-4(n-l)(£-(y-x)/2) 2 



^2^-n+V2 f cxp-£ 2 /2 



= (n 



2n 



2 N-n+l/2 f CXp— £ 2 /2 



(2n-2)S 2 



+ 1 



-n+l/2 



7 



2tt 



(£-(y-*)/2) 2 



I/O"" 



+ 1 



>+l)/2 



d£ 



= C* 



exp -g 2 /2 

\/2tT 



where t(£|/z, cr, f) is the density of the Student's T(v, fj,,a 2 ) distribution with 
parameters v = 2n — 2, fi = (y — x)/2, and a 2 = S 2 /4(n — 1), and C is the 
constant 



C = {S 2 ) 



2 v-n+i/2 /r((i/ + i)/2)/r(i//2) 



(TWVTT 



Therefore, we can simulate a sample £i, . . . , £„ from the T{y, fi, a 2 ) distribu- 
tion and approximate the above integral by the average 

C " exp-e 2 /2 

n ^-i 

i=i 



2tt 
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using an R program like 

n=100 

N=1000000 

nu=2*n-2 

barx= . 088 

bary= . 1078 

mu= . 5* (bary-barx) 

stwo=. 00875 

sigma=sqrt ( . 5*stwo/nu) 

C=log(stwo) * (-n+ . 5)+log(sigma*sqrt (nu*pi) )+ 
lgamma ( . 5*nu) -lgamma ( . 5*nu+ . 5) 



# T simulation 

xis=rt(n=N,df=nu)*sigma + mu 
B01=-log(cumsum(dnorm(xis) ) / (1 : N) ) 

B01=exp( (-n+.5)*log(.5*(barx-bary)~2+stwo)+B01-C ) 

# Normal simulation 
xis=rnorm(N) 

C01=cumsum( (stwo+ . 5* (2*xis+barx-bary) ~2)~ (-n+ . 5) ) / (1 : N) 
C01= ( ( . 5* (barx-bary) ~2+stwo) " (-n+ . 5) ) /C01 



# Comparison of the cumulated averages 

plot (C01 [seq(l , N , 1=1000)] , type="l" , col="tomato2" , lwd=2 , 

ylim=c(20,30) ,xlab=expression(N/100) ,ylab=expression(l/B [10] ) ) 
lines(B01 [seq(l ,N, 1=1000)] , col="steelblue3" , lwd=2 , lty=5) 



As shown on Figure 2.2 the precision of the estimator based on the T(i/, /i, cr 2 ) 
simulation is immensely superior to the precision of the estimator based on 
a normal sample: they both converge to the same value 23.42, but with very 
different variances. 



Exercise 2.23 Discuss what happens to the importance sampling approximation 
when the support of g is larger than the support of 7. 



If the support of 7, (S 7 , is smaller than the support of g, the representation 

h(x)g(x) 



j(x) 



j(x) dx 



is not valid and the importance sampling approximation evaluates instead the 
integral 
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o 200 400 eoo SOO 1 ooo 

N / 1 OO 



Fig. 2.2. Comparison of two approximations of the Bayes factor £?oi based on 10 
simulations. 

/ r^— 70) da;. 

Exercise 2.24 Show that the importance weights of Example 2.2 have infinite 
variance. 



The importance weight is 

n 

exp {{8~rf/2} n^ + ^-^r 1 
i=l 

with 9 ~ ,jV(h,<j 2 ). While its expectation is finite — it would be equal to 1 
were we to use the right normalising constants — , the expectation of its square 
is not: 

~ n 

I exp {(6 - M ) 2 /2} J]! 1 + (*< ~ d )T 2 d9 = oo , 
due to the dominance of the exponential term over the polynomial term. 



Exercise 2.25 Show 


that, when 7 is the normal </K(0, vj(y — 


2)) density, the 


ratio 








flip) e*>- 2 )/ 2 " 
-fix) K [l+x 2 /v](»+ l ) 








does not have a finite 


integral. What does this imply about the 


variance of the 


importance weights? 
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This is more or less a special case of Exercise |2.24| with again the exponen- 
tial term dominating the polynomial term, no matter what the value of v > 2 
is. The importance weights have no variance. When running an experiment 
like the following one 

nu=c (3 , 5 , 10 , 100 , 1000 , 10000) 

N=length(nu) 

T=1000 

nors=rnorm(T) 

par (mf row=c (2 , N/2) , mar=c (4,2,4,1)) 

for (nnu in nu){ 

y=sqrt (nnu/ (nnu-2) ) *nors 
isw=dt (y , df =nnu) /dnorm (y) 

hist (log(isw) ,prob=T, col="wheat4" ,nclass=T/20) 

} 

the output in Figure |2.3| shows that the value of v still matters very much in 
the distribution of the weights. When v is small, the probability to get very 
large weights is much higher than with large f 's, and the dispersion decreases 
with v. 




Fig. 2.3. Distributions of the log-importance weights for a normal importance 
distribution against a Student's t target for several values of v. 
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Exercise 2.26 Given two model densities f\{&\9) and f 2 (%>\6) with the 
same parameter 9 and corresponding priors densities tt\{9) and 7r 2 (0), denote 
ni(9\^) = /i(j^|6>)7ri(6») and 7r 2 (6>|^) = / 2 (^|0)7T 2 (0), and show that the Bayes 
factor corresponding to the comparison of both models satisfies 

f 7ri(0|#)a(0)7r 2 (0|0)d0 

/" 7f 2 (0|^)a(0)7T 1 (0|^)d0 

for every positive function a and deduce that 

ni ^ *i(0 2l |^)a(0 2l ) /n 2 ^ 7T 2 (0i i |^)a(0 li ) 

i=l ' i=l 

is a convergent approximation of the Bayes factor B\ 2 when 9ji <~ TTj(d\&) 
(*= 1,2, j = l,...,nj). 

The missing normalising constants in tti(6\@) and 7r 2 (0|f^) are the marginal 
densities mi(f) and m 2 (^), in the sense that (i = 1,2) 

7Ti(fl|^) = 7r<(e|^)/mj(^) . 

Therefore, 

/" 7ri(0|^)a(0)7r 2 (0|^)d0 
y 7r 2 (0|^)a(0)7n(0|^)d0 

y m 1 (S')TT 1 (9\&)a(9)Tr 2 (9\S')d9 
J m 2 {9)pi 1 (e\^)a{e)n 2 {e\^)de 

- m 2 (0) - 

and a is irrelevant for the computation of the ratio of integrals. 

A Monte Carlo implementation of this remark is to represent each integral 
in the ratio as an expectation under ir 2 (9\&) and iri(9\&) respectively. Simu- 
lations 9jiS from both posteriors then produce convergent estimators of the 
corresponding integrals. This method is called bridge sampling and the choice 
of a is relevant in the variance of the corresponding estimator. 
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Exercise 2.27 Show that, when n goes to infinity and when the prior has an 
unlimited support, the predictive distribution converges to the exact (sampling) 
distribution of x n+ \. 



This property follows from the fact that the posterior distribution con- 
verges to a Dirac mass at the true value 6* of the parameter when n goes to 
infinity [under some regularity conditions on both n and f(x\8), as well as 
idcntifiability constraints]. Therefore, 

J f(x n+1 \e)n(9\@ n )d6 

converges to f(x n+ \\6*). 

Exercise 2.28 Show that, when X is distributed from an increasing and con- 
tinuous cdf F, F(X) has a uniform distribution. 



If F is increasing and continuous, it is invertible and we have 

P(F(X) < u) = P(X < F" 1 ^)) = F(F^ 1 (u)) =u, 

when < u < 1. This demonstrates that F(X) is uniformely distributed and 
this property can be exploited for simulation purposes when F is available in 
closed form. 
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Exercise 3.1 Show that the matrix X is of full rank if and only if the matrix 
X T X is invertible (where X J denotes the transpose of the matrix X, which can 
produced in R using the t(X) command). Deduce that this cannot happen when 

k+ 1 > n. 



The matrix X is a (n,k + 1) matrix. It is of full rank if the k + 1 columns 
of X induce a subspace of R™ of dimension (k + 1), or, in other words, if 
those columns are linearly independent: there exists no solution to X'j = 0„ 
other than 7 = 0„, where 0^, +1 denotes the (k + l)-dimensional vector made 
of O's. If X T X is invertible, then Xj = 0„ implies X T X~/ = X T n = k+1 
and thus 7 = (X T X)~ 1 0k+i = Ok+i, therefore X is of full rank. If X 1 X is 
not invertible, there exist vectors (3 and 7^/3 such that X T Xf3 = X T Xj, 
i.e. X T X(f3 - 7) = Ofc+i. This implies that \\X(0 - ~f)\\ 2 = and hence 
X(P — 7) = 0„ for (3 — 7 7^ Ofc+i, thus X is not of full rank. 

Obviously, the matrix (fc + 1, k + 1) matrix X 1 X cannot be invertible if 
k + 1 > n since the columns of X are then necessarily linearly dependent. 

Exercise 3.2 Show that solving the minimization program above requires solv- 
ing the system of equations (X T X)f3 = X T y. Check that this can be done via 
the R command 

> solve (t(X) •/.*'/. (X) ,t(X)%*y.y) 



If we decompose (y — X(3) T (y — Xf3) as 

y J y-2y J X(3 + l3 J X 1 Xl3 



24 3 Regression and Variable Selection 
and differentiate this expression in j3, we obtain the equation 
-2y T X + 2(3 T X T X = fe+1 , 

i.e. 

{X T X)(3 = X T y 

by transposing the above. 

As can be checked via help (solve), solve ( A, b) is the R function that 
solves the linear equation system Ax = b. Defining X and y from caterpillar, 

we get 

> solve (t (x) °/.*y.x , t (X) y.*y,y) 



[,1] 



rep(l, 33) 


10. 


.998412367 


VI 


-0. 


,004430805 


V2 


-0. 


.053830053 


V3 


0. 


.067939357 


V4 


-1. 


,293636435 


V5 


0. 


.231636755 


V6 


-0. 


,356799738 


V7 


-0. 


.237469094 


V8 


0. 


.181060170 


V9 


-1. 


.285316143 


V10 


-0. 


.433105521 



which [obviously] gives the same result as the call to the linear regression 
function lm(): 

> lm(y-X-l) 



Call: 

lm(f ormula = y ~ X - 1) 
Coefficients : 

XrepCl, 33) XVI XV2 XV3 XV4 XV5 

10.998412 -0.004431 -0.053830 0.067939 -1.29363 0.23163 

XV6 XV7 XV8 XV9 XV 10 

-0.356800 -0.237469 0.181060 -1.285316 -0.43310 

Note the use of the -1 in the formula y~X-l that eliminates the intercept 
already contained in X. 



Exercise 3.3 Show that V(/3|cr 2 , X) = a 2 {X T X)~ 
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Since (3 = (X 1 X)~ 1 X T y is a linear transform of y ~ J/(X(3, a 2 I n ), we 
have 

~ JK ((X T X)- 1 X T Xf3, a 2 (X T X)- 1 X T X(X T X)- 1 ) , 

Exercise 3.4 Taking advantage of the matrix identities 

(M + X T Xy 1 = M -1 — M~ x (M -1 + (X 1 "^)" 1 ) -1 M- 1 

= (X T X)- 1 - (X T X)- 1 (M- 1 + (X T X)- 1 y 1 (X T X)- 1 

and 

X T X(M + X T X)- 1 M = (M-\M + X T X)(X T X)- 1 )' 1 
= (M _1 + (X T X)~ 1 ) 1 , 

establish that (3.3) and (3.4) are the correct posterior distributions. 

Starting from the prior distribution 

(3\a 2 ,X ~ jT k+1 {0, a 2 M~ l ) , a 2 \X ~ J^f(a, 6) , 
the posterior distribution is 

7r(/3, a 2 0, S 2 ,X)K a -k-l-2a-2-n cxp ^_ _ ^T M(/? _ ^ 

+(/3 - /3) T (X T X)(/3 - /3) + s 2 + 26} 

= a -k-n-2a-3 cxp _± ( M + X T X)/3 - 2/? T (M/3 + Jf T X/3) 



+/3 T M/3 + (3 T (X T X)fi + s 2 + 2&} 



= rr-^"- 2 "- 3 cxp — ^ {(/? - £[/%, X]) T (M + X T X)(/3 - E[/%, X]) 
+/3 T M/3 + /3 T (X T X)/3 - E[/%, X] T (M + X T X)E[/3\y, X] + s 2 + 2fe} 

with 

E[0\y, X] = (M + X T X)- 1 {Mj3 + X T X(3) . 

Therefore, (3.3) is the conditional posterior distribution of (3 given a 2 . Inte- 
grating out (3 leads to 
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n(a 2 \l3,s 2 7 X) oca-"- 2a - 2 cxp ^ {/3 T M/3 + p T (X T X)p 
-E[p\y, X] T (M + X T X)E[p\y, X] + s 2 + 2b} 

= a -n-2a-2 cxp _± ^ M fi + pT(X T X)j3 + S 2 + 2b 

-{Mp + X T Xp) T (M + X T X)- 1 (M/3 + X T Xp)} 

Using the first matrix identity, we get that 

(M/3+X T X(3) T (M + X T xy X (Mf3 + X J X(3) 
= (3 T Mf3 - p~ T (M -1 + (X T X)- i y 1 p 
+ (3 T (X T X)p - (3 T (M -1 + (X T X)- 1 y 1 p 
+ 2(J T (X T X) (M + X T Xy 1 MP 
= P T Mf3 + p T (X T X)p 
-0- P) T (M- 1 + (X T X)- 1 )~ 1 (P - P) 

by virtue of the second identity. Therefore, 

7r(a 2 |/3, s 2 , X) oc a- n ~ 2a - 2 cxp ^ [{~P - P) T (M" 1 
+ (X T X)- 1 y 1 (P-p) + s 2 + 2b} 

which is the distribution (3.4). 

Exercise 3.5 Give a (1 - a) HPD region on P based on (3.6). 

As indicated just before this exercise, 

P\y,X~ (n + 2a,£,£) . 

This means that 

{ T - , \ <n+2a+k+l) 

and therefore that an HPD region is of the form 

53 a = {/3;,(/3-A) T ^- 1 (/3-A)<fca} , 
where k a is determined by the coverage probability a. 
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Now, ((3 — jl) T S~ 1 (f3 — p.) has the same distribution as ||z|| 2 when 
z ~ 3?k+i(n + 2a, 0, Ik+i)- This distribution is Fisher's T(k + 1, n + 2a) dis- 
tribution, which means that the bound k a is determined by the quantiles of 
this distribution. 



Exercise 3.6 The regression model can also be used in a predictive sense: 
for a given (m, k + 1) explanatory matrix X, the corresponding outcome y 
can be inferred through the predictive distribution 7r(y|<7 2 , y, X, X). Show that 
7r(y | cr 2 , y, X, X) is a Gaussian density with mean 

E[y|a 2 , y, X, X] = E[E(y|/3, a 2 , y, X, X)\a\ y, X, X] 
= E[X/3\a 2 ,y,X,X} 
= X(M + X T X)- 1 (X T XP + Mj3) 

and covariance matrix 

V(y|<7 2 , y, X)x = E[V(y|/?, a 2 , y, X, X)\a 2 , y, X, X] 

+W(E[y\P,a 2 ,y,X,X]\a 2 ,y,X,X) 
= E[a 2 / m |a 2 , y, X, X] + W(Xf3\a 2 , y, X, X) 
= a 2 {I m + X(M + X T X)- 1 X T ) . 

Deduce that 

y|y,X,X~ (n + 2a,X(M + X T X)- 1 (X T Xf3 + Mfi), 

2b+s 2 + 0- /3) T (Af- 1 + (X T X)- i y 1 - p) 
n + 2a 

x {j m + Jt(M + x T xy 1 x T ^ . 



Since 

and since the posterior distribution of (3 conditional on a is given by (3.3), we 
have that 

y |y, X, X, a ~ ,jV (lE[/V, V, X] , <J 2 I m + Xvar(/3|y, X, a)X T ^j , 

with mean X(M + X T Xy 1 (X T X/3 + M/3), as shown by the derivation of 
Exercise 3.4 The variance is equal to a 2 (l m + X(M + X T Xy 1 X T ^j . 
Integrating a 2 against the posterior distribution (3.4) means that 



2S 



3 Regression and Variable Selection 



7r(y|y,X,X)(x 



_i 

a - m -n-2a-l exp S 2 b + S 2 - 

-i / 7, 



09 - /3) T (M- 1 + (X^)- 1 ) (0 - /?) + (y - E^cr 2 , y, X, X]) T (I„ 
+X(M + X J X)- 1 X T )- 1 - /3)(y - E^a 2 , y, X, 1]) T } da 2 

cx {2fe + ,s 2 + - /3) T (M" 1 + (X T X)- i y 1 0-$) + (y 
-Elylrr 2 , y, X, X]) 1 \l m + X(M + X T X)~ 1 X T )- 1 - /3)(y 

-(m+n+2a)/2 



-E[y\a 2 ,y,X,X}) j y 



which corresponds to a Student's T distribution with (n + 2a) degrees of 
freedom, a location parameter equal to E[y|er 2 , y, X, X] [that does not depend 
on a] and a scale parameter equal to 

{2b + s 2 + 0- P) T (M" 1 + (X T X)~ 1 y 1 0-0)} 

x I m + X(M + X T X)- 1 X T ] /(n + 2a) . 



Exercise 3.7 Show that the marginal distribution of y associated with (3.3) 
and (3.4) is given by 

y\X~ ST n (2a,XP 7 -(I n + XM- 1 X T ] 
\ a 



This is a direct consequence of Exercise 3.6 when replacing (y,X) with 
(y, X) and (y, X) with the empty set. This is indeed equivalent to take m = n, 
n = 0, X = 0, ,s 2 = and 

- /3) T (M- 1 + (XTX)- 1 )- 1 0-p) = O 
in the previous exercice. 



Exercise 3.8 Given the null hypothesis Hq : R(3 = 0, where R is a (q,p) matrix 
of rank q, show that the restricted model on y given X can be represented as 

y\0o, Cq'Xo ~° ^Ki (XoPo, cr 2 I n ) 

where X is a (n, k — q) matrix and /3q is a (k — q) dimensional vector. (Hint: 
Give the form of X$ and /?o in terms of X and (3.) Under the hypothesis specific 
prior f3 a \H 0l a 2 ~ ,yV k _ q ((3 , a 2 (M )-^ and a 2 \H Q ~ J?&(a ,b ), construct 
the Bayes factor associated with the test of H . 
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When Rj3 = 0, (3 satisfies q independent linear constraints, which means 
that q coordinates of can be represented as linear combinations of the (k — q) 
others, denoted by /3o, e.g. 

Pii = s 7i A), ■ ■ • , Pi q = sJ q f3o , 
where i\ < • • • < i — q are the above coordinates. This implies that 

( Sj \ 

X(3 = X ■ • • A) = *oA) , 

where is either one of the above linear coefficients or contains except 
for a single 1. Therefore, when Hq holds, the expectation of y conditional 
on X can be written as X$(3q where Xq is a (n, k — q) matrix and [3q is a 
(k — q) dimensional vector. (Actually, there is an infinite number of ways to 
write E[y|X] in this format.) The change from a to Co is purely notational to 
indicate that the variance Cg is associated with another model. 



If we set a conjugate prior on (flo, (To), the result of Exercise 3.7 also applies 
for this (sub-)modcl, in the sense that the marginal distribution of y for this 
model is 

y|^o ~ ST n ^2a ,X /3o, ^(I„ + XoM^Xl)^ . 

Therefore, the Bayes factor for testing H can be written in closed form as 

r((2a + n)/2)/r(2a /2)/aV2^o(b /a Q ) n / 2 \I n + X M^X^\ 1 ' 2 
r((2a + n)/2)/r(2a/2)/aV^(b/a) n / 2 \I n + XM' 1 X T \ 1 / 2 

r -i -(2a +rt)/2 

[l + (y - X /3 ) T (I n + X Mo *o) _1 (y - X (3 )/2b j 



B, 



{l + (y - Xpy(I n + XM-iJfT)-i(y - X^)/26} 



-(2a+n)/2 



Note that, in this case, the normalising constants matter because they differ 
under H and under the alternative. 



Exercise 3.9 Show that 



c(s 2 + (j g _ 0)T X TX(0 - (3)) , X T X) -1 

n(c + l) 
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Since 

(3\a\y,X~JK k+1 ( ^—(P/c + (3), -^-(X 1 X)~ 



a 2 c 



c + 1 v ' ' ' c+1 
^-|y- -V - 1G ( |, y + 2^iy(^ - /3) T ^ T ^(/3 ~ /3) 



we have that 



with 

s 2 + r-TTT^ " /3) T ^ T ^(/3 ~ 0)1 /^ 2 ~ X 2 : 
(c+1) J 

which is the definition of the Student's 



V c+1 I c / n(c + 1) 

distribution. 



Exercise 3.10 Show that 7r(y|er 2 , y, X, X) is a Gaussian density. 



Conditional on a 2 , there is no difference with the setting of Exercise 3.6 
since the only difference in using Zellner's G-prior compared with the conju- 
gate priors is in the use of the noninformative prior ir{a 2 \X) cx o~ 2 . Therefore, 
this is a consequence of Exercise |3.6| 



Exercise 3.11 The posterior predictive distribution is obtained by integration 
over the marginal posterior distribution of a 2 . Derive Tt(y\y,X,X). 



Once more, integrating the normal distribution over the inverse gamma 
random variable a 2 produces a Student's & distribution. Since 

a 2 |y, X~igfe,^+ ^Y)0 ~ P) T X T X0 - 0) 
under Zellner's G-prior, the predictive distribution is a 

-I y y „ ( y /3+C/3 c{ S 2 +CP-PyX^X{P~P)/{c+l)) 

y\y,X,X ~ 3? k+1 \n,X—^-, 



x \l m + -^—X{X J X)- 1 X 1 

y c+i 



distribution. 
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Exercise 3.12 Give a joint (1 — a) HPD region on /3. 



Since we have (Exercise 3.9 1 



with 



c(s 2 +0- p) t x t x0 - $)) Tx ; 

n(c+l) / 
c(s 2 + 0- p) T X T X0 - $)) vT x 



n(c+l) 



-(x'xy 



an HPD region is of the form 

where k a is determined by the coverage probability a and 

c /^.^ ^ = c(s 2 + 0- P) T X T X0 - 0)) {x t x) -i 



c+ 1 \ c ' I n(c + 1) 

the distribution of (/3 — /i) T Z ,_1 (/3 — /t) is a Fisher's J 7 (k + 



3.5 



As in Exercise 



l,n) distribution. 



Exercise 3.13 Show that the matrix (7„ + cX(X T X) _1 X T ) has 1 and c + 1 
as eigenvalues. (Hint: Show that the eigenvectors associated with c+1 are of 
the form X/3 and that the eigenvectors associated with 1 are those orthogonal 
to X, i.e. z's such that X T z — 0.) Deduce that the determinant of the matrix 

(/„ + cX(X T X)- 1 X T ) is indeed (c + l)^ 1 )/ 2 . 



Given the hint, this is somehow obvious: 

(J„ + cX(X T X)- 1 X T )Xf3 = X/3 + cX(X T X)- 1 X T Xf3 = (c + 1)X(3 
(I n + cX(X T X)- 1 X T )z = z + cX(X T X)- 1 X J z = z 

for all /3's in M fe+1 and all z's orthogonal to X. Since the addition of those 
two subspaces generates a vector space of dimension n, this defines the whole 
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set of eigenvectors for both eigenvalues. And since the vector subspace gen- 
erated by X is of dimension (k + 1), this means that the determinant of 
(J„ + cX(X T X)- 1 X T ) is (c + l) fc+1 x l"-fe-i. 



Exercise 3.14 Derive the marginal posterior distribution of (3 for this model. 

The joint posterior is given by 

f3\a\y,X ~ Ji +1 (0,o 3 (X r X)- i ) , 
a 2 \y,X ~ S&((n-k-l)/2,s 2 /2). 

Therefore, 

P\y,X~% +l L-k-lJ, ^-^-j {X T X)- 1 ^ 
by the same argument as in the previous exercises. 



Exercise 3.15 Show that the marginal posterior distribution of /3j (1 < i < k) 

is a &i(n— k — 1, $i,u>u i)S 2 /{n — k — 1)) distribution. (Hint: Recall that wu ^ = 

The argument is straightforward: since (3\a 2 , y, X ~ t/^+i <7 2 (X T X) -1 ^ , 

A | cr 2 , y , JsT ~ ,jV (fa, a 2 Lo^ijj . Integrating out a 2 as in the previous exercise 
leads to 

Pi\a 2 ,y,X ~ £i(n- fc - I, faw^s 2 /(n- k - 1)) . 



Exercise 3.16 Give the predictive distribution of y, the m dimensional vector 
corresponding to the (m, fc) matrix of explanatory variables X. 



This predictive can be derived from Exercise 3.6 Indeed, Jeffreys' prior 
is nothing but a special case of conjugate prior with a = b = 0. Therefore, 
Exercise |3.6| implies that, in this limiting case, 



y|y, X, X ~ ST m (n, X(M + X T X)- 1 {X r XP + M(3), 



s 2 + 0~$) T (M- 1 + {X T X)- 1 ) \j3~j3) 
n 

x {/ m + X(M + X T X)- 1 X T ^ . 
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Exercise 3.17 When using the prior distribution 7r(c) = 1/c 2 , compare the 
results with Table 3.6. 



In the file #3.txt provided on the Webpage, it suffices to replace cc" (-1) 

with cc~(-2) : for instance, the point estimate of (3 is now 

> f acto=sum(cc/(cc+l)*cc~ (-2) * (cc+1) " (-11/2) * 
+ (t (y) °/„*%y-cc/ (cc+1) *t (y) °/,*°/.P'/.*°/.y) ~ (-33/2) ) / 
+ sum(cc~ (-2) * (cc+1) ~ (-11/2) * (t (y) °/.*°/.y-cc/ 

+ (cc+1) *t (y) /„*°/.P°/.*°/.y) * (-33/2) ) 

> f acto*betahat 







[.1] 


[1,] 


8 


506662193 


[2,] 


-0 


003426982 


[3,] 


-0 


041634562 


[4,] 





052547326 


[5,] 


-1 


000556061 


[6,] 





179158187 


[7,] 


-0 


275964816 


[8,] 


-0 


183669178 


[9,] 





140040003 


[10,] 


-0 


994120776 


[11,] 


-0 


334983109 



Exercise 3.18 Show that both series (3.10) and (3.11) converge. 



Given that 

r I -n/2 

y T y - -^ry T x(x T x)- 1 x T y 



/(y|I,c)a(c+l)-( t+1 >/ 2 



r 

« C - (fe+1)/2 [y T y - y t x(x^x)-'x^ y ] ~ n/2 

when c goes to oo, the main term in the series goes to as a o(c~( fe+3 )/ 2 ) and 
the series converges. 

Obviously, if the first series converges, then so does the second series. 



Exercise 3.19 Give the predictive distribution of y, the m-dimensional vector 
corresponding to the (to, k) matrix of explanatory variables X. 
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Fig. 3.1. Support of the uniform distribution. 



The predictive of y given y,X, X is then the weighted average of the 
predictives given y,X, X and c: 

oo 

7r(y|y, X, X) oc £ 7r(y|y, X, X, c)f(y\X, c) c" 1 

c=l 

where 7r(y|y, X, X, c) is the Student's & distribution obtained in Exercise 

Exercise 3.20 If {x\,X2) is distributed from the uniform distribution on 

{(»i,x a ); (xi-l) a + (a?a-l) 2 <l}u{(xi,xa); (n + l) 2 + (a* + l) 2 < 1} , 

show that the Gibbs sampler does not produce an irreducible chain. For this dis- 
tribution, find an alternative Gibbs sampler that works. (Hint: Consider a rotation 
of the coordinate axes.) 



The support of this uniform distribution is made of two disks with re- 
spective centers (—1,-1) and (1,1), and with radius 1. This support is not 
connected (see Figure |3.1| and conditioning on ii < means that the con- 
ditional distribution of X2 is ^(—1 — \J\ — x\, — 1 + \J\ — x\ , thus cannot 
produce a value in [0, 1]. Similarly, when simulating the next value of x\, it 
necessarily remains negative. The Gibbs sampler thus produces two types of 
chains, depending on whether or not it is started from the negative disk. If 
we now consider the Gibbs sampler for the new parameterisation 
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yi = x 1 + x 2 , V2 = x 2 -xi, 

conditioning on 31 produces a uniform distribution on the union of a negative 
and of a positive interval. Therefore, one iteration of the Gibbs sampler is 
sufficient to jump [with positive probability] from one disk to the other one. 



Exercise 3.21 If a joint density 3(2/1,2/2) corresponds to the conditional distri- 
butions c?i (2/1 12/2) and 32(2/212/1). show that it is given by 

, x £2(2/212/1) 

5(2/1,2/2) = 



/ 92(v\y 1 )/g 1 (y 1 \v) dv' 



If the joint density 3(2/1,2/2) exists, then 

3(2/1,2/2) = 3 1 (2/i)S'2(2/2|2/i) 
= .9 2 (2/2) 31 (2/1 1 2/2) 

where 3 1 and g 2 denote the densities of the marginal distributions of 2/1 and 
32, respectively. Thus, 

1/ x .9l(2/l|?/2) 2/ \ 

5 (2/1) = , , x fl (2/2 
52(2/2|2/i) 

3i(2/i 1 2/2) 

32(32 12/1) ' 

as a function of 31 [5^32) is irrelevant]. Since 3 1 is a density, 



and 



, 1(yi) = ^) / f^M du 

32(32131)/ 7 32(32!") 

5(2/1,2/2) = 31 (2/1I32)/ /^4^r4 du - 
/ 7 32(32!") 



Since 31 and 32 play symmetric roles in this derivation, the symmetric version 
also holds. 



Exercise 3.22 Check that the starting value of /j, in the setting of Example 
3.2 has no influence on the output of the above Gibbs sampler after N = 1000 
iterations. 



The core of the Gibbs program is 
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> mu = rnorm(l , sum(x*omega) /sum(omega+ . 05) , 
+ sqrt(l/(.05+2*sum(omega))) 

> omega = rexp(2, l+(x-mu) ~2) 

which needs to be iterated N = 1000 times to produce a Gibbs iV-sample from 
ir(fi\T>). A R program evaluating the [lack of] influence of the starting value 
^(°) will thus need to compare histograms of ^,( 1000 )'s for different starting 
values. It therefore requires three loops: 

x=c(3. 2,-1 . 5) # observations 
mu0=seq(-10,10,length=20) # starting values 
muk=rep (0,250) 

par(mfrow=c(5,4) ,mar=c(4, 2 ,4, 1) ) # multiple histograms 
for (i in 1:20){ 
for (t in 1:250){ 
mu=mu0 [i] 

for (iter in 1:1000){ 

omega = rexp(2, l+(x-mu) ~2) 
mu = rnorm(l,sum(x*omega)/sum(omega+.05) , 
sqrt (l/( . 05+2*sum (omega) ) ) ) 

} 

muk [t] =mu 

} 

hist (muk,proba=T, col="wheat" ,main=paste(mu0 [i] ) ) 



} 

Be warned that the use of this triple loop induces a long wait on most ma- 
chines! 



Exercise 3.23 In the setup of Section 3.5.3, show that 




CO 

7r( 7 |y,X) oc X>- X (c+ l)-^' +1)/2 [y T y- 

c=l 






-n/2 




and that the series converges. If 7r(c) cx c a , find which values of a lead to a 


proper posterior. 
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We can take advantage of Section 3.5.2: when c is fixed in Zellner's infor- 
mative G-prior and /? 7 = 9 _ +1 for all 7's, 



7r( 7 |y,X,c)oc( C +l)-^ +1 )/ 2 
thus 

7r( 7 ,c|y,X,c) oc ^{c+l)-^ 1 ^ 2 
and 

00 



1 -™/2 



V 2 



c=l 

00 



yTy ~ ^TT yT ^ (^t) -1 ^ 7 T y 



a^c-^c+l)"^ 1 )/ 2 

c=l 

For 7r(c) cx c~ q , the series 
00 

Je-^+l)-^ y T y - ^y 1 * 7 (X^)" 1 * 7 T y 



-n/2 



c=l 



i/2 



converges if and only if, for all 7's, 

a+ ^_>1, 

which is equivalent to 2a + g 7 > 1. Since min(g , 7 ) = 0, the constraint for 
propriety of the posterior is 

a>l/2. 
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Exercise 4.1 For bank, derive the maximum likelihood estimates of Po and Pi 
found in the previous analysis. Using Jeffreys prior on the parameters (Po , Pi , c 2 ) 
of the linear regression model, compute the corresponding posterior expectation 

of (A>, ft). 




The code is provided in the file #4.txt on the Webpage. If the bank dataset 
is not available, it can be downloaded from the Webpage and the following 
code can be used: 



bank=matrix( scan ("bank") ,byrow=T,ncol=5) 
y=as . vector (bank [ , 5] ) 

X=cbind(rep(l ,200) , as. vector (bank [,1] ) , as. vector (bank [, 2] ) , 

as. vector (bank [, 3] ) , as. vector (bank [, 4] )) 
summary (lm(y~X[, 5] )) 

which produces the output [leading to eqn. (4.1) in the book]: 

> summary(lm(y~X[,5] )) 

Call: 

lm(formula = y ~ X[, 5]) 
Residuals : 

Min 1Q Median 3Q Max 

-0.76320 -0.21860 -0.06228 0.18322 1.04046 



Coefficients : 

Estimate Std. Error t value Pr(>|t|) 
(Intercept) -2.02282 0.14932 -13.55 <2e-16 *** 
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X[, 5] 0.26789 0.01567 17.09 <2e-16 *** 

Sig. codes: <***> .001 '**' .01 .05 '.' .1 ' ' 1 

Residual standard error: 0.3194 on 198 degrees of freedom 
Multiple R-Squared: 0.596, Adjusted R-squared: 0.594 

F-statistic: 292.2 on 1 and 198 DF, p-value : < 2.2e-16 



As shown in Exercise 3.14 the [marginal] posterior in (3 associated with 
the Jeffreys prior is 



T 



0\y, X ~ ? k+1 { n - k - 1, (3, — j^(X ' X) 



so the posterior expectation of (/3o,/?i) is again (3. 



Exercise 4.2 Show that, in the setting of Example 4.1, the statistic Y27=i J/i xI 
is sufficient when conditioning on the x l 's (1 < i < n) and give the corresponding 
family of conjugate priors. 



Since the likelihood is 

exp J Vi 1 / ft [1 + exp(x* T /?)] 

I i=l J ' i=l 

= exp I [Vi x *] T P \ I ft [1 + ex P (x jT /3)] , 
I i=i J ' i=i 

it depends on the observations (yx, . . . , y n ) only through the sum Y^i=i Vi x * 
which is thus a sufficient statistic in this conditional sense. 
The family of priors ((eI t ,A>0) 

/ - 

A) ex exp {£ T /3} / JJ [1 + cxp(x iT /3)] A , 
' i=i 

is obviously conjugate. The corresponding posterior is 



TT [0 



£ + ^y,x l ,A + i I - 



»=i 



whose drawback is to have A updated in A + 1 rather than A + n as in other 
conjugate settings. This is due to the fact that the prior is itself conditional 
on X and therefore on n. 
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Exercise 4.3 Show that the logarithmic link is the canonical link function in 
the case of the Poisson regression model. 



The likelihood of the Poisson regression model is 

t(0\y,X) = f[ (J-^j exp{ 2/l x jT /3-exp(x lT / 3)} 

n 1 

= I| — f exp {yi log(/Xj) - m} , 

Hi- 



i=l 



so log(/Uj) = x lT /3 and the logarithmic link is indeed the canonical link func- 
tion. 



Exercise 4.4 Suppose yi,--.,yk ar e independent Poisson V(fJ,i) random vari- 
ables. Show that, conditional on ra = J2i=i Vi< 

y = (Vi,---,yk) ~ Mk(n; <*i,..., a k ) 
and determine the a/s. 



The joint distribution of y is 



while n = J2i=iVi ~ ^(SiLiMi) [which can be established using the mo- 
ment generating function of the V(/j) distribution]. Therefore, the conditional 
distribution of y given n is 

n-=i(S)exp{-E-=i^} /* \ 



n 



Mi 



n^lJ/i' i=l \%2i=l Pi 



EH- 



which is the pdf of the A4k(n; a\, . . . , o^) distribution, with 

Mi 



i = 1, . . . , fc . 
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This conditional representation is a standard property used in the sta- 
tistical analysis of contingency tables (Section 4.5): when the margins are 
random, the cells are Poisson while, when the margins are fixed, the cells are 
multinomial. 



Exercise 4.5 Show that the detailed balance equation also holds for the Boltz- 
mann acceptance probability 

P(x, y) ~ 



n(y)q(y,x) +ir(x)q(x,y) 



The detailed balance equation is 

n(x)q(x,y)p(x,y) = n(y)q(y,x)p(y,x) . 
Therefore, in the case of the Boltzmann acceptance probability 

n{y)q{y,x) 



Tr(x)q(x,y)p(x,y) = n(x)q(x,y) 



n{y)q{y,x) + n(x)q{x,y) 
n(x)q(x,y)n(y)q(y,x) 



K{y)q{y,x) + Tt{x)q{x,y) 

n(x)q(x,y) 



= n{y)q{y,x) 



n(y)q(y,x) + 'K(x)q(x,y) 
= n(y)q(y,x)p(y,x) . 

Note that this property also holds for the generalized Boltzmann acceptance 
probability 

p ( xy j = n{y)q{y,x)a{x,y) 



n(y)q(y, x)a(x, y) + n(x)q(x, y)a(y, x) 
where a(x,y) is an arbitrary positive function. 



Exercise 4.6 For ir the density of an inverse normal distribution with parameters 

6>i = 3/2 and 9 2 = 2, 

ir(x) oc x" 3/2 exp(-3/2a; - 2/x)I x>0 , 

write down and implement an independence MH sampler with a Gamma proposal 
with parameters (a,/3) = (4/3,1) and (a,0) = (0.5^/173,0.5). 



A R possible code for running an independence Metropolis-Hastings sam- 
pler in this setting is as follows: 
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# target density 

target=f unction (x, the 1=1 . 5 ,the2=2) { 
x~ (-thel)*exp(-thel*x-the2/x) 
} 

al=4/3 
bet=l 

# initial value 
mcmc=rep(l , 1000) 

for (t in 2:1000){ 

y = rgamma(l,shape=al,rate=bet) 

if (runif (l)<target (y) *dgamma(mcmc [t-1] , shape=al ,rate=bet) / 
(target (mcmc [t-1] ) *dgamma(y , shape=al ,rate=bet) ) ) 
mcmc [t] =y 
else 

mcmc [t] =mcmc [t-1] 

} 

# plots 

par (mf row=c (2, 1) ,mar=c(4, 2 ,2, 1) ) 

res=hist (mcmc , f req=F , nclass=55 , prob=T , col= " grey56 " , 

ylab="",main="") 
lines (seq(0 . 01 ,4, length=500) , valpi*max(res$int) /max(valpi) , 

lwd=2 , col= " sienna2 " ) 
plot (mcmc, type="l" , col="steelblue2" ,lwd=2) 

The output of this code is illustrated on Figure |4.1| and shows a reasonable 
fit of the target by the histogram and a proper mixing behaviour. Out of the 
1000 iterations in this example, 600 corresponded to an acceptance of the 
Gamma random variable. (Note that to plot the density on the same scale as 
the histogram, we resorted to a trick on the maxima of the histogram and of 
the density.) 

Exercise 4.7 Estimate the mean of a Sfa(4.3,6.2) random variable using 

1. direct sampling from the distribution via R command 

> x=rgamma(n,4.3,rate=6.2) 

2. Metropolis-Hastings with a Sfa(4, 7) proposal distribution; 

3. Metropolis-Hastings with a ^a(5,6) proposal distribution. 

In each case, monitor the convergence of the cumulated average. 



44 4 Generalized Linear Models 




o 200 400 600 aoo 1000 

Index 

Fig. 4.1. Output of an MCMC simulation of the inverse normal distribution. 

Both independence Metropolis-Hastings samplers can be implemented via 
an R code like 

al=4.3 
bet=6.2 

mcmc=rep(l , 1000) 
for (t in 2:1000){ 

mcmc [,t] =mcmc [ ,t-l] 

y = rgamma(500,4,rate=7) 

if (runif(l)< dgamma(y,al,rate=bet)*dgamma(mcmc [t-1] ,4,rate=7)/ 
(dgamma(mcmc [t-1] , al,rate=bet)*dgamma(y ,4,rate=7) )){ 
mcmc [t] =y 
} 

} 

aver=cumsvrm (mcmc) / 1 : 1000 

When comparing those samplers, their variability can only be evaluated 
through repeated calls to the above code, in order to produce a range of out- 
puts for the three methods. For instance, one can define a matrix of cumulated 
averages aver=matrix (0,250, 1000) and take the range of the cumulated av- 
erages over the 250 repetitions as in ranj=apply (aver , 1 .range) , leading to 
something similar to Figure |4.2| The complete code for one of the ranges is 

al=4.3 
bet=6.2 
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mcmc=matrix ( 1 , ncol=1000 , nrow=500) 
for (t in 2: 1000) { 
mcmc [ ,t] =mcmc [ ,t-l] 
y = rgamma(500,4,rate=7) 
valid= (runif (500) <dgamma(y , al ,rate=bet) * 

dgamma (mcmc [i ,t-l] ,4,rate=7)/(dgamma(mcmc[,t-l] ,al,rate=bet)* 
dgamma(y ,4,rate=7) ) ) 
mcmc [valid, t] =y [valid] 
} 

aver2=apply (mcmc , 1 , cumsum) 
aver2=t (aver2/ (1 : 1000) ) 
ranj 2=apply (aver2 , 2 , range) 

plot (ranj 2 [1 , ] , type= " 1" , ylim=range (ran j 2) , ylab= " " ) 
polygon(c(l : 1000,1000: 1) ,c (ranj 2 [2,] , rev (ranj 2 [1 ,] ) ) ) 

which removes the Monte Carlo loop over the 500 replications by running the 
simulations in parallel. We can notice on Figure [472] that, while the output 
from the third sampler is quite similar with the output from the iid sampler 
[since we use the same scale on the y axis] , the Metropolis-Hastings algorithm 
based on the 5fa(4, 7) proposal is rather biased, which may indicate a difficulty 
in converging to the stationary distribution. This is somehow an expected 
problem, in the sense that the ratio target-over-proposal is proportional to 
x 3 exp(0.8a;), which is explosive at both x — and x = oo. 




200 400 SOO SOO O 200 4DD 600 800 O 200 400 BOO BOO 



Fig. 4.2. Range of three samplers for the approximation of the ^a(4.3, 6.2) mean: 
(left) iid; (center) If a(4, 7) proposal; (right) <fa(5,6) proposal. 



46 4 Generalized Linear Models 



Exercise 4.8 Consider x\, X2 and X3 iid ^(6,1), and ir(8) oc exp(— 9 2 /l00). 
Show that the posterior distribution of 8, ir(8\xi,X2,X3), is proportional to 

exp(-0 2 /l()O)[(l + (0 - Xl )2)(l + (0 _ X2 )2)(l + (0 _ -,3)2)]-! 

and that it is trimodal when x\ — 0, X2 — 5 and X3 = 9. Using a random walk 
based on the Cauchy distribution "^(0, a 2 ), estimate the posterior mean of 8 using 
different values of a 2 . In each case, monitor the convergence. 



The function (4.1) appears as the product of the prior by the three densities 
f(xi\8). The trimodality of the posterior can be checked on a graph when 



plotting the function (4.1 



A random walk Metropolis-Hastings algorithm can be coded as follows 

x=c(0,5,9) 
# target 

targ=f unction(y) { 

dnorm(y , sd=sqrt (50) ) *dt (y-x [1] , df =1) * 
dt(y-x[2] ,df=l)*dt(y-x[3] ,df=l) 

} 



# Checking trimodality 
plot(seq(-2, 15,length=250) , 

targ(seq(-2 , 15 , length=250) ) , type="l" ) 

sigma=c( . 001 , . 05 , 1) *9 # different scales 
N=100000 # number of mcmc iterations 



mcmc=matr ix (mean (x) , ncol=3 , nrow=N) 
for (t in 2:N){ 



mcmc [t , ] =mcmc [t- 1 , ] 

y=mcmc [t ,] +sigma*rt (3, 1) # rnorm(3) 

valid=(runif (3)<targ(y) /targ(mcmc [t-1 ,] ) ) 

mcmc [t , valid] =y [valid] 

} 

The comparison of the three cumulated averages is given in Figure |4.3| and 
shows that, for the Cauchy noise, both large scales are acceptable while the 
smallest scale slows down the convergence properties of the chain. For the 
normal noise, these features are exacerbated in the sense that the smallest 
scale does not produce convergence for the number of iterations under study 
[the blue curve leaves the window of observation] , the medium scale induces 
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some variability and it is only the largest scale that gives an acceptable ap- 
proximation to the mean of the distribution (4.1 1. 




Oe-t-OO 4e+04 8e+04 Oe+OO 4e+04 8e+04 



iterations iterations 

Fig. 4.3. Comparison of the three scale factors a = .009 (blue), a = .45 (gold) and 
a = 9 (brown), when using a Cauchy noise (left) and a normal noise (right). 



Exercise 4.9 Rerun the experiment of Example 4.4 using instead a mixture of 
five random walks with variances a — 0.01, 0.1, 1, 10, 100, and equal weights, and 
compare its output with the output of Figure 4.3. 



The original code is provided in the files #4. Rand #4. txt on the webpage. 
The modification of the hml function is as follows: 

hmi=f unction (n, xO , sigma2) 
{ 

x=rep(xO,n) 
for (i in 2:n){ 
x[i]=x[i-l] 

y=rnorm(l ,x [i-1] , sqrt (sample (sigma2, 1) ) ) 
if (runif (l)<dnorm(y) /dnorm(x [i-1] ) ) 
x[i]=y 

} 

x 
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Note that picking the variance at random does not modify the random walk 
structure of the proposal, which is then a mixture of normal distributions all 
centered in x^ x \ The output compares with Figure 4.3 [from the book] but 
the histogram is not as smooth and the autocorrelations are higher, which 
can be easily explained by the fact that using a whole range of scales induces 
inefficiencies in that the poor scales arc chosen for "nothing" from time to 
time. 



Fig. 4.4. Simulation of a ,yf(0, 1) target with a normal mixture: top: sequence of 
10, 000 iterations subsampled at every 10-th iteration; middle: histogram of the 2, 000 
last iterations compared with the target density; bottom: empirical autocorrelations 
using R function plot.acf. 



Exercise 4.10 Find conditions on the observed pairs (x 1 ,^) for the posterior 
distribution above to be proper. 



This distribution is proper (i.e. well-defined) if the integral 

/n 
~[[<P(x lT f3) Vi [1 - <2>(x tT /3)] l ~ Vi d/3 
»=1 

is finite. If we introduce the latent variable behind <£(x lT (3), we get by Fubini 
that 



3 = / fl<p{zi) I 

J 7-1 Jt/3;x'T 



d/3 dzi • • • dz n , 



{/3;x iT /3)^ Zi , i=l,...,n} 
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where x lT /3 ^ Zi means that the inequality is x lT /3 < Zi if j/j = 1 and x lT /3 < 
otherwise. Therefore, the inner integral is finite if and only if the set 



is compact. The fact that the whole integral 3 is finite follows from the fact 
that the volume of the polyhedron defined by ^ grows like \zi\ k when Zi goes to 
infinity. This is however a rather less than explicit constraint on the (x% j/j)'s! 

Exercise 4.11 Include an intercept in the probit analysis of bank and run the 
corresponding version of Algorithm 4.2 to discuss whether or not the posterior 
variance of the intercept is high 

We simply need to add a column of l's to the matrix X, as for instance in 

> X=as.matrix(cbind(rep(l,dim(X) [1] ) ,X)) 

and then use the code provided in the file #4.txt, i.e. 

f latprobit=hmf latprobit (10000 , y , X , 1) 
par (mf row=c (5 , 3) ,mar=l+c (1 .5,1.5,1.5,1.5)) 
for (i in 1:5){ 
plot (f latprobit [ , i] ,type="l" ,xlab=" Iterations" , 

ylab=expression(beta[i] ) ) 
hist (f latprobit [1001 : 10000 , i] ,nclass=50 ,prob=T,main=" " , 

xlab=expression(beta[i] ) ) 
acf (f latprobit [1001 : 10000 , i] , lag=1000 ,main=" " , 
ylab=" Autocorrelation" ,ci=F) 

} 

which produces the analysis of bank with an intercept factor. Figure [4~5| gives 
the equivalent to Figure 4.4 [in the book]. The intercept (5q has a posterior 
variance equal to 7558.3, but this must be put in perspective in that the 
covariates of bank are taking their values in the magnitude of 100 for the 
three first covariates and of 10 for the last covariate. The covariance of 
is therefore of order 7000 as well. A noticeable difference with Figure 4.4 [in the 
book] is that, with the inclusion of the intercept, the range of j3\S supported 
by the posterior is now negative. 

Exercise 4.12 Using the latent variable representation of Example 4.2, intro- 
duce Zi\(3 ~ jY (x iT /3, l) (1 < i < n) such that y t = M Zi < . Deduce that 



%={0;x ir 0^z i , i 



1, . . . ,n} 




(4.2) 
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Fig. 4.5. bank: estimation of the probit coefficients [including one intercept /3o] via 
Algorithm 4.2 and a flat prior. Left: fit's (i — 0, ... ,4); center: histogram over the 
last 9, 000 iterations; right: auto-correlation over the last 9, 000 iterations. 



where jY+ (/j,, 1, 0) and jY- (fi, 1,0) are the normal distributions with mean fj, and 
variance 1 that are left-truncated and right-truncated at 0, respectively. Check 
that those distributions can be simulated using the R commands 

> xp=qnorm(runif (l)*pnorm(mu)+pnorm(-mu))+mu 

> xm=qnorm(runif (l)*pnorm(-mu))+mu 

Under the flat prior 7r(/?) oc 1, show that 

/3|y,z ~ ,jV k ((X T X)- 1 X T z, (X T X)- 1 ) , 
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where z = (zi,...,z n ) and derive the corresponding Gibbs sampler, sometimes 
called the Albert-Chib sampler. (Hint: A good starting point is the maximum 
likelihood estimate of f3.) Compare the application to bank with the output in 
Figure 4.4. (Note: Account for computing time differences.) 



If Zi\j3 ~ jV (x lT (3, l) is a latent [unobserved] variable, it can be related 
to i/i via the function 

since P(y t = 1) = P(z l > 0) = 1 - <P (-x lT /3) = <P (x lT /3). The conditional 
distribution of Zi given yi is then a constrained normal distribution: if j/j = 1, 
zi < and therefore 

z i \y i = l,/3^^ + (^ T p,l,0) . 

(The symmetric case is obvious.) 

The command qnorm(runif (1) *pnorm(mu)+pnorm(-mu) )+mu is a simple 
application of the inverse cdf transform principle given in Exercise |2.28| the 
cdf of the yl+ (/i, 1, 0) distribution is 

_ <P(x - M ) - 

If we condition on both z and y [the conjunction of which is defined as the 
"completed model"], the y^s get irrelevant and we are back to a linear regres- 
sion model, for which the posterior distribution under a flat prior is given in 
Section 3.3.1 and is indeed jV k ((X T X)- 1 X T z, (X T X)- 1 ). 

This closed- form representation justifies the introduction of the latent vari- 
able z in the simulation process and leads to the Gibbs sampler that simulates 



(3 given z and z given (3 and y as in (4.2 ). The R code of this sampler is avail 



able in the file #4.R as the function gibbsprobit. The output of this function 



is represented on Figure 4.6 Note that the output is somehow smoother than 
on Figure 4.5 (This does not mean that the Gibbs sampler is converging faster 
but rather than its component-wise modification of the Markov chain induces 
slow moves and smooth transitions.) 

When comparing the computing times, the increase due to the simulation 
of the Zi's is not noticeable: for the bank dataset, using the codes provided 
in #4.txt require 27s and 26s over 10,000 iterations for hmflatprobit and 
gibbsprobit. respectively. 



Exercise 4.13 Find conditions on YliVi an ^ on — Vi) f° r tne posterior 

distribution defined by (4.3) to be proper. 
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Fig. 4.6. bank: estimation of the probit coefficients [including one intercept /3o] by 
a Gibbs sampler 4.2 under a flat prior. Left: /Vs (i = 0, ... ,4); center: histogram 
over the last 9, 000 iterations; right: auto-correlation over the last 9, 000 iterations. 



There is little difference with Exercise 14.101 because the additional term 
(/3 T (X 1 X) (3) ^ 2k is creating a problem only when [3 goes to 0. This dif- 
ficulty is however superficial since the power in ||X/3||( 2fe_1 )/ 2 is small enough 
to be controlled by the power in ||X/3|| fc_1 in an appropriate polar change of 
variables. Nonetheless, this is the main reason why we need a 7r(c 2 ) oc cr~ 3 / 2 
prior rather than the traditional 7r(er 2 ) oc cr~ 2 which is not controlled in (3 = 0. 
(This is the limiting case, in the sense that the posterior is well-defined for 
7r(cr 2 ) cx a~ 2+e for all e > 0.) 
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Exercise 4.14 For bank, compute the Bayes factor associated with the null 
hypothesis Hq : {3 2 — = 0. 



The Bayes factor is given by 

7r- fc / 2 r((2fc- l)/4) 
01 ~ 7r-( fe - 2 )/ 2 r{(2fc-5)/4} 

J (/3 T (X T X)/3)~ (2fc ~ 1)/4 nr =1 <Z>(x' T /?)^ [1 -Sfr'T/?)] 1 -* d/3 

x /{(/? o ) T (x T xo)/3o}" (2 ^ 5)/4 nr=i f(4 T /3 o ) yi [i-#(4 T ^)] 1 " , "d^ ' 

For its approximation, we can use simulation from a multivariate normal as 
suggested in the book or even better from a multivariate a direct adapta- 
tion from the code in #4.txt is 

noinf probit=hmnoinf probit (10000 ,y ,X, 1) 
library (mnormt) 

mkprob=apply (noinf probit , 2 , mean) 
vkprob=var (noinf probit) 
simk=rmvnorm( 100000 .mkprob , 2*vkprob) 
usk=probitnoinf lpost (simk.y ,X) - 

dmnorm(simk , mkprob , 2*vkprob , log=TRUE) 

noinf probit0=hmnoinf probit (10000 ,y,X[,c(l,4)],l) 
mk0=apply (noinf probitO , 2 , mean) 
vk0=var (noinf probitO) 
simk0=rmvnorm( 100000, mk0,2*vk0) 
usk0=probitnoinf lpost (simkO ,y,X[,c(l,4)])- 

dmnorm(simk0 ,mk0 , 2*vk0 , log=TRUE) 
bf 0probit=mean(exp(usk) )/mean(exp(usk0) ) 

(If a multivariate & is used, the dmnorm function must be replaced with 
the density of the multivariate 3?.) The value contained in bfOprobit is 
67.74, which is thus an approximation to B\q [since we divide the approx- 
imate marginal under the full model with the approximate marginal under 
the restricted model]. Therefore, Hq is quite unlikely to hold, even though, in- 
dependently, the Bayes factors associated with the componentwise hypotheses 
Hq : 02 = and Hq : /3 3 = support those hypotheses. 



Exercise 4.15 Compute the Jacobian \dpi ■ ■ ■ dpk/d(3\ ■ ■ ■ d(ik\ and deduce 
that the transform of the prior density 7r(pi, . . . ,Pk) in the prior density above is 
correct. 
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Since pi = <P(ic lT {3), we have 

which means that the Jacobian Z is the determinant of the matrix made of X 
multiplied by ip(ic lT (3) on each row. Therefore, 

k 

a = n<K* iT /3)pq- 

(Note that this is a very special case where X is a square matrix, hence \X\ 
is well-defined.) Since \X\ does not depend on (3, it does need to appear in 
tt(/3), i.e. 

k 

7r(/j) oc n m {r 0) Kt81 - 1 [i - ^^)] Kiii ~ Bi) ~ i ■ 

i=l 



Exercise 4.16 In the case of the logit model, i.e. when pi = expx* T /3/{l + 
cxpx lT /3} (1 < i < k), derive the prior distribution on (3 associated with the 
prior (4.5) on (pi,... ,p k )- 



The only difference with Exercise 4.15 is in the use of a logistic density, 
hence both the Jacobian and the probabilities are modified: 



r(/3) ex Yl 



exp({^ gi - l}x' T /3) exp(x* T /3) 
=\ {1 + exp(x* T /3)} Kl ~ 2 {1 + cxp(x iT /3)} 2 



exp K tgi Z lT l3 



[] {l + exp^)}^ 



i=l 



Exercise 4.17 Examine whether or not the sufficient conditions for propriety of 
the posterior distribution found in Exercise 4.13 for the probit model are the same 
for the logit model. 



There is little difference with Exercises |4.10| and |4.13| because the only 
change is [again] in the use of a logistic density, which has asymptotics similar 
to the normal density. The problem at (5 — is solved in the same manner. 
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Exercise 4.18 For bank and the logit model, compute the Bayes factor associ- 
ated with the null hypothesis Hq : fa = fa = and compare its value with the 



value obtained for the probit model in Exercise 4.14 



This is very similar to Exercise |4.14[ except that the parameters are now 
estimated for the logit model. The code is provided in file #4 . txt as 

# noninf ormative prior and random walk HM sample 
noinf logit=hmnoinf logit (10000 , y , X , 1) 

# log-marginal under full model 
mklog=apply (noinf logit , 2 , mean) 
vklog=var (noinf logit) 
simk=rmnorm ( 100000 ,mklog, 2*vklog) 
usk=logitnoinf lpost (simk , y , X) - 
dmnorm (simk , mklog , 2*vklog , log=TRUE) 

# noninf ormative prior and random walk HM sample 

# for restricted model 

noinf logit0=hmnoinf logit ( 10000 ,y,X[,c(l,4)],l) 

# log-marginal under restricted model 
mk0=apply (noinf logitO ,2 ,mean) 
vk0=var (noinf logitO) 
simk0=rmnorm( 100000 ,mk0 , 2*vk0) 
usk0=logitnoinf lpost (simkO ,y ,X [,c(l,4)])- 
dmnorm(simk0 ,mk0 , 2*vk0 , log=TRUE) 

bf 01ogit=mean(exp(usk) )/mean(exp(usk0) ) 

The value of bf Ologit is 127.2, which, as an approximation to B* , argues 
rather strongly against the null hypothesis Hq. It thus leads to the same con- 
clusion as in the probit model of Exercise |4.14| except that the numerical value 
is almost twice as large. Note that, once again, the Bayes factors associated 
with the componentwise hypotheses Hq : fa — and Hq : fa — support 
those hypotheses. 



Exercise 4.19 In the case of a 2 x 2 contingency table with fixed total count 
n — nn + ni2 + n^i + n-22, we denote by On, 612, O21, O22 the corresponding 
probabilities. If the prior on those probabilities is a Dirichlet 2?4(l/2, . . . , 1/2), 
give the corresponding marginal distributions of a — 6\i + Q12 and of /? — On + 
621- Deduce the associated Bayes factor if H is the hypothesis of independence 
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between the factors and if the priors on the margin probabilities a and (3 are those 
derived above. 



A very handy representation of the Dirichlet T>k(5i, . . . , 5k) distribution is 

(a, ...,&) 



& + •■•+&) 

Therefore, if 



V k (5i,...,Sk) when & ~ #a(5j, 1) , i = 1, . . . , A; . 



then 



and 



a /)/)/) \ (£ll, £l2, £21, £22) t iid „, , , . 

?n, 0i2, 6>2i, 022) = t — — —z — ~ S?a(l/2,1) 

4ll + 4l2 + 421 + 422 

(4n + 4i2, 42i + C22) 



(011 + 012, 021 + 022) 



4ll + £l2 + 421 + 422 



iid 



(£11 +£12), (421 + 422) ~Sfo(l,l) 

implies that a is a £8e(l, 1) random variable, that is, a uniform ^(01,) vari- 
able. The same applies to [3. (Note that a and are dependent in this repre- 
sentation.) 

Since the likelihood under the full model is multinomial, 



t(0\T)=l 0^0^0 2 T0 2 "f) 
\nn n\2 n,2i/ 

where T denotes the contingency table [or the dataset {tin, ni2, ri2i, ^22}], 
the [full model] marginal is 



m(T) = 



n 

niini 2 n 2 i 

7T 2 

n 



£7111-1/2 ^nis-1/2 ^"21-1/2 /)n 2 2-l/2 ,/i 
tf n f 12 tf 21 W 22 dtf 



1/2) 



7T 

n 



r(n + 2) 

n%+v2) 



7T~ 



(n + 1)! 
r(«y + 1/2) 



1 n nmj + 1/2 



(n + 



where the 7r 2 term comes from 7^(1/2) = y/n. 

In the restricted model, 0n is replaced with a/3, 0i2 by a(l — /3), and so 
on. Therefore, the likelihood under the restricted model is the product 
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f n ja Tll (l-a) n - ni x QM/3 ni (l-/3) n - ni , 

where ri\. = nn + riyi and n.\ — ri\\ + n%i, and the restricted marginal under 
uniform priors on both a and [3 is 

m Q (T) = Q 1 ) (j 1 ) j[ « ni '(! - a)""" 1 da jf /3 n i (l - 1 d/3 

_ / n \ ( n \ (m. + l)!(n - ni. + 1)! (n. x + l)!(n - n. x + 1)! 

Ui / U J ~ (n + 2)! (n + 2)! 

_ (ni. + l)(n - n x . + 1) (n.x + l)(n - n.x + 1) 
(n + 2)(n+ 1) (n + 2)(n+ 1) 

The Bayes factor is then the ratio mo(T)/m(T). 



Exercise 4.20 Given a contingency table with four categorical variables, deter- 
mine the number of submodels to consider. 



Note that the numbers of classes for the different variables do not matter 
since, when building a non-saturated submodel, a variable is in or out. There 
are 

1. 2 4 single-factor models [including the zero- factor model]; 

2. (2 6 — 1) two-factor models [since there are ( 2 ) = 6 ways of picking a pair 
of variables out of 4 and since the complete single-factor model is already 
treated] ; 

3. (2 4 — 1) three-factor models. 

Thus, if we exclude the saturated model, there are 2 6 + 2 5 — 2 = 94 different 
submodels. 



Exercise 4.21 Find sufficient conditions on (y, X) for this posterior distribution 
proportional to be proper. 



the term (fi T {X T X)0) 



-(2fe-l)/4 



is not a major 



First, as in Exercise 4.13 
problem when (3 goes to 0, since it is controlled by the power in the Jacobian 
||X/3|| fc_1 in an adapted polar change of variables. Moreover, if the matrix X 
of regressors is of full rank k, then, when f3j (J = 1, . . . , k) goes to ±oo, there 
exists at least one 1 < i < n such that x lT /? goes to either +oo or — oo. In the 
former case, the whole exponential term goes to 0, while, in the later case, it 
depends on yi. For instance, if all y/s are equal to 1, the above quantity is 
integrable. 



5 



Capture— Recapture Experiments 



Exercise 5.1 Show that the posterior distribution n(N\n + ) given by (5.1) while 
associated with an improper prior, is defined for all values of n + . Show that the 
normalization factor of (5.1) is n + V 1 and deduce that the posterior median 
is equal to 2(n + V 1). Discuss the relevance of this estimator and show that it 
corresponds to a Bayes estimate of p equal to 1/2. 



Since the main term of the series is equivalent to TV 2 , the series converges. 
The posterior distribution can thus be normalised. Moreover, 

OO OO /-| -| 

V = T -- — 

^ i(i+ 1) ^ V i i + 1 

i— no i-no x 

1111 

n n + l n + l n Q + 2 
1 

n ' 

Therefore, the normalisation factor is available in closed form and is equal to 
n + V 1. The posterior median is the value N* such that 7r(iV > N*\n + ) = 1/2, 
i.e. 



OO 

V — 



1 1 



v< + 2 n+Vl 

1 

which implies that N* = 2(n + VI). This estimator is rather intuitive in that 
E[n + \N,p] = pN: since the expectation of p is 1/2, E[n+|A/] = N/2 and 
N* = 2n + is a moment estimator of N. 
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Exercise 5.2 Under the prior n(N,p) <x N 1 , derive the marginal posterior 
density of N in the case where ~ £§{N,p) and where k — 1 iid observations 

n+,. . . ,n+ ~ ^(nf,p) 
are observed (the later are in fact recaptures). Apply to the sample 

(n+ n+ ...,n+) = (32,20,8,5,1,2,0,2,1,1,0), 
which describes a series of tag recoveries over 11 years. 



In that case, if we denote n+ = n+ + • • • + the total number of captures, 
the marginal posterior density of N is 

TV' 

7r(JV|n+...,n+)oc — _ n + y N l N>n+ 



f 1 

/ p n t+-+ n t (1 - p)JV-n+ + (ni+-n++---+n+-n+ ^ 

Jo 



p" + (l-p) JV + fe ^-™ + dp 



(N-n+y. N ^ J0 
(N-l)\ (N + knf-n+y. 
K (7V-n+)! (7V + fcn+ + l)! ^><vi . 

which does not simplify any further. Note that the binomial coefficients 



(j > 2) 



are irrelevant for the posterior of N since they only depend on the data. 
The R code corresponding to this model is as follows: 

nl=32 

ndo=sum(32 ,20, 8, 5, 1,2, 0,2, 1,1,0) 

# unnormalised posterior 
post=f unction(N) { 

exp(lf actor ial(N-l)+lf actorial (N+ll*nl-ndo)- 
lf actorial (N-nl) -If actorial (N+ll*nl+l) ) 

} 



# normalising constant and 

# posterior mean 
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posv=post ( (nl : 10000) ) 
cons=sum(posv) 

pmean=sum( (nl : 10000) *posv) /cons 
pmedi=sum(cumsum(posv)< . 5*cons) 

The posterior mean is therefore equal to 282.4, while the posterior median is 
243. Note that a crude analysis estimating p by p — (n 2 +. ■ . + nn)/(10n ] f ) = 
0.125 and N by n\ jp would produce the value N = 256. 



Exercise 5.3 For the two-stage capture-recapture model, show that the distri- 
bution of m 2 conditional on both samples sizes n\ and n 2 is given by (5.2) and 
does not depend on p. Deduce the expectation E[m 2 |ni, n 2 , N]. 



Since 

m~ 3§(N,p), m 2 \n 1 ~ 3§{n u p) 

and 

n 2 - m 2 |ni, m 2 ~ 38{N - ni,p) , 
the conditional distribution of m 2 is given by 

f(m 2 \n 1 ,n 2 ) oc ( Hl )p m Hl - p) n ^ m * ( N ~ Ul ) p n *~ m H\ - p )N-n 1 -n 2 +m 2 
\m 2 J \n 2 -m 2 J 

qj- f n l \ f N — Ui\ ^ TO2 +n 2 -m 2 /j _ pjni-m2+JV-ni-n 2 +m2 



\m 2 / \n 2 - m 2 

( fl\ \ ( N — Til 



\m 2 / \n 2 — to 2 

VrW 



Vm 2 / Vn 2 — m 2 



which is the hypergeometric J$?(N,n 2 ,ni/N) distribution. Obviously, this 
distribution does not depend on p and its expectation is 

E[m 2 \ni,n 2 ] = — . 



Exercise 5.4 In order to determine the number TV of buses in a town, a capture- 
recapture strategy goes as follows. We observe ri\ = 20 buses during the first day 
and keep track of their identifying numbers. Then we repeat the experiment the 
following day by recording the number of buses that have already been spotted 
on the previous day, say m 2 = 5, out of the n 2 = 30 buses observed the second 
day. For the Darroch model, give the posterior expectation of N under the prior 
ir(N) = l/N. 
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Using the derivations of the book, we have that 



7r(iV|ni, n 2 , m 2 ) oc 



1 



N 



N \n+ 



B(n c + l,2N-n c +l)l N > n + 



(N-l)l (2N~n c ){ 
a (N-n+y. (27V + 1)! N ^ n+ 

with n + = 45 and n c = 50. For n + — 45 and n c = 50, the posterior mean 
[obtained by an R code very similar to the one of Exercise 5.2 is equal to 
130.91. 



Exercise 5.5 Show that the maximum likelihood estimator of N for the Darroch 
model is N ~ n\j (mijn^) and deduce that it is not defined when m 2 = 0. 



The likelihood for the Darroch model is proportional to 
ffm (AT-m)! (N-n+y 



Since 



(N-n 2 )\ N\ 

£(N + 1) _ (N + 1 - m)(N + 1 - n 2 ) 
£(N) ~ (N+ 1 -n+)(N+ 1) 



> 1 



for 

(N + l) 2 - (N + l)(m + n 2 ) + ni n 2 > (JV + l) 2 - (N + l)n 4 

(iV + l)(ni + n 2 — n ) > nin 2 

n x n 2 



(N+ 1) < 



m 2 



the likelihood is increasing for N < n!n 2 /m2 and decreasing for N > 
nin 2 /m2. Thus N = nin 2 /m2 is the maximum likelihood estimator [assum- 
ing this quantity is an integer]. If m 2 = 0, the likelihood is increasing with N 
and therefore there is no maximum likelihood estimator. 



Exercise 5.6 Give the likelihood of the extension of Darroch's model when 
the capture-recapture experiments are repeated K times with capture sizes and 
recapture observations (1 < k < K) and rrik (2 < fc < K), respectively. (Hint: 
Exhibit first the two-dimensional sufficient statistic associated with this model.) 



When extending the two-stage capture-recapture model to a if-stage 
model, we observe K capture episodes, with rii ~ £s§(N,p) (1 < i < K), 
and K — 1 recaptures, 
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mi|ni,Ti2,TO2, • • • ,ni-i,mi-i,ni ~ (W, n;,ni + n 2 - m 2 H mj_i) . 

The likelihood is therefore 

K / nj\ K /ni-m 2 -\ to«_i\ /N—m-\ hm«_i\ 

n l.w -pj^-- n- — - — ,n\ m, ~ m< — - 

i=l ^ i=2 VriiJ 

(N\-n+)\ P [ V) 

where n+ = n\ — m 2 + ■ ■ ■ — m K is the number of captured individuals and 
where n c = ri\ + • • • + nx is the number of captures. These two statistics are 
thus sufficient for the if -stage capture-recapture model. 




When n + = 0, there is no capture at all during both capture episodes. 
The likelihood is thus (1 — p) 2N and, under the prior w(N,p) = 1/N, the 
conditional posterior distributions of p and N are 

p\N,n + = - 38e{l,2N+ 1) , 
(l-p) 2N 



N\p,n + = 



N 



That the joint distribution ir(N, p\n + = 0) exists is ensured by the fact that 
n(N\n+ = 0) cx 1/N(2N + 1), associated with a converging scries. 



Exercise 5.8 Show that, when the prior on TV is a 3^{X) distribution, the con- 
ditional posterior on N - n + is ,^(A(1 - p) 2 ^ 




The posterior distribution of (N,p) associated with the informative prior 
n(N,p) = X N e~ x /N\ is proportional to 



is -N>n+ 



\ N P n \l-p? 



(N-n+)\N\ 

The corresponding conditional on N is thus proportional to 



which corresponds to a Poisson ^(A(l — p) 2 ) distribution on N — n + . 



+ jjP n *(l-p) 2JV - nC ^N>n + 
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Exercise 5.9 An extension of the T-stage capture- recapture model is to 
consider that the capture of an individual modifies its probability of 
being captured from p to q for future captures. Give the likelihood 

£{N,p, q\m, n 2 , m 2 ...,n T , m T ) 




When extending the T-stage capture-recapture model with different prob- 
abilities of being captured and recaptured, after the first capture episode, 
where n\ <~ £$(N,p), we observe T — 1 new captures (i = 2, . . . ,T) 

ni - mi\ni,ri2,m2, ■ ■ ■ ,n l ^ 1 ,m i ^ 1 ~ £§(N - n\ - n 2 + ra 2 + . . . + m l ^ 1 ,p) , 

and T — 1 recaptures (i — 2, . . . ,T), 

TOi|ni,n 2 ,m 2 , . . . ,nj_i,mj_i — £S{n\ + n 2 - m 2 + . . . - m;_i,g) . 

The likelihood is therefore 



N ) p ni (i - P ) N - n > TT ( N ~ ni + " ' " " mi_1 V" 4 "" 14 (i - p) Ar -" i+ - +m * 

my f = 2 V n l -m l 



n 

TV! 



ni + n 2 - . . . - mj_i 



? mi (l-9) 



ni+.-.-m, 



oc 



(IV -n+)\ 



where n + = m — m 2 + ■ ■ ■ — rriT is the number of captured individuals, 



n 



* = Tn 1 + Y J {T-j + l){n j -m j ) 

3=2 



and where m + = mi + - • -+m T is the number of recaptures. The four statistics 
(ni,n + ,n* ,m + ) arc thus sufficient for this version of the T-stage capture- 
recapture model. 



Exercise 5.10 Another extension of the 2-stage capture-recapture model is to 
allow for mark losses. If we introduce q as the probability of losing the mark, r as 
the probability of recovering a lost mark and k as the number of recovered lost 
marks, give the associated likelihood £(N,p, q, r\ni, n 2 , m 2 , k). 



There is an extra-difficulty in this extension in that it contains a latent 
variable: let us denote by z the number of tagged individuals that have lost 
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their mark. Then z ~ 33(m,q) is not observed, while k ~ &(z, r) is observed. 
Were we to observe (n 1; n 2 , m 2 , k, z), the [completed] likelihood would be 

e(N,p,q,r\n u n 2 ,m 2 ,k,z)= Q p»i(l "P)" - " 1 </*(!- «) ni_z 

^ r k (l - r) z - k ( Ul ~ *) p m2 (l - p )«i-*-"* 
k) \ m 2 J 

N — Tlx + Z\ p n2 - m2 _ x N- ni +z-n 2 +m 2 

since, for the second round, the population gets partitioned into individuals 
that keep their tag and are/are not recaptured, those that loose their tag 
and are/are not recaptured, and those that are captured for the first time. 
Obviously, it is not possible to distinguish between the two last categories. 
Since z is not known, the [observed] likelihood is obtained by summation over 
z: 

£(N,p,q,r\ ni ,n 2 ,m 2 ,k) cx — ^ ! p n ^{\ - p )^-m-n a 



E 

Z=k\/N— 711— 712+7712 



ni\ fni- 
z J I m 2 



N ni+Z ^q z (l-q) n i- z r k (l-r) z - k . 



n 2 - m 2 

Note that, while a proportionality sign is acceptable for the computation of the 
likelihood, the terms depending on z must be kept within the sum to obtain 
the correct expression for the distribution of the observations. A simplified 
version is thus 

t(N,p, q, r\n u n 2 ,m 2 ,k) cx ^^yy p" 1+ " 2 (1 - p) 2N - n ^ <T (r/(l - r)) k 

n y 2 (N-n 1 + z)l[q(l-r)/(l-q)r 

z =kvN^- n2+ m 2 Z ^ Z m ^ N -ni-n 2 + m 2 + z)\' 

but there is no close-form solution for the summation over z. 



Exercise 5.11 Reproduce the analysis of eurodip when switching the prior from 




The main purpose of this exercise is to modify the code of the function 
gibbsl in the file #5.R on the webpage, since the marginal posterior distri- 
bution of N is given in the book as 
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,*r, + « (N-l)\ T7V- n c l 

Tr(N\n + ,n c ) oc -+ -, ^- I;v> n +vi • 

1 ' 1 (N-n+)l (TN + 1)1 N ^ n+V1 

(The conditional posterior distribution of p does not change.) This distribution 
being non-standard, it makes direct simulation awkward and we prefer to use 
a Metropolis-Hastings step, using a modified version of the previous Poisson 
conditional as proposal q(N'\N,p). We thus simulate 

N* - n+ ~ & -J9 ( *- X) ) T ) 

and accept this value with probability 

Tr(N*\n+ 1 n c ) q(N ( - t - 1 '>\N* 1 p ( - t - 1 ')) 
ir(N^- 1 )\n+,n c ) qjN*\N^ ,P {t ^) A ' 

The corresponding modified R function is 

gibbs 1 l=f unction (nsimu , T , nplus , nc) 
{ 

# conditional posterior 
rati=f unction(N) { 

If actorial(N-l)+lf actorial (T*N-nc) - 
If actorial (N-nplus) -If actorial (T*N+1) 

} 



N=rep(0, nsimu) 
p=rep(0, nsimu) 



N[l]=2*nplus 

p [1] =rbeta(l ,nc+l ,T*N [1] -nc+1) 
for (i in 2: nsimu) { 



# MH step on N 
N[i]=N[i-l] 

prop=nplus+rpois(l,N[i-l] *(l-p[i-l] )~T) 

if (logCrunif (l))<rati (prop) -rati (N[i])+ 

dpois (N [i-1] -nplus , prop* (1-p [i-1] ) "T , log=T) - 
dpois (prop-nplus , N [i-1] * (1-p [i-1] ) "T, log=T) ) 
N [i] =prop 

p [i] =rbeta(l ,nc+l ,T*N [i] -nc+1) 

} 

list(N=N,p=p) 
} 

The output of this program is given in Figure |5.1| 
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O 1 OO 200 300 400 500 




O 1 OO 200 300 400 SOO 





320 360 400 440 



0.16 0.1S 0.20 0.22 0.24 



Fig. 5.1. eurodip: MCMC simulation under the prior n(N,p) oc N . 



Exercise 5.12 Show that the conditional distribution of n is indeed propor- 
tional to the product (5.4). 



The joint distribution of T>* — (ri%, c 2 , c 3 , r 1; r 2 ) is given in the book as 



l p) N " ^ ^ ' j tf i I ' ("' f . ri ')i''-'( l /'»" 



Therefore, if we only keep the terms depending on ri, we indeed recover 

g ri (l-g)" 1_ri 7-^ (1 - P ) ni - ri - C2 



ri!(?ii - n)! 

(«i - ri)! 



(ni - r x - c 2 )l 



- (1 - q)»i-n-r 2 ^ 7 ll ! _^ Q _ pjm-n-ra-c 

(ni-ri-r 2 j! ' (ni - n - r 2 - c 3 j! 

("-1 - n)\ f g 

n!(ni - n - c 2 )!(ni - n - r 2 - c 3 )! \ (1 - g) 2 (l - p) 2 

ri 



m — c 2 \ ni—ri 



q 



r 2 + c 3 J l(l-g) 2 (l-p) 2 



under the constraint that n < min(ni, rii — r 2 , rii — r 2 — C3, ni— c 2 ) = min(ni • 
r 2 - c 3 ,ni - c 2 ). 
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Exercise 5.13 Show that r 2 can be integrated out in the above joint distribution 
and leads to the following distribution on r\. 

, i x (ni - ri)!(m - n - c 3 )! 

7r(ri \p,q,ni,c 2 ,c 3 ) oc r"; (5.1) 

n!(ni - ri - c 2 )! 

9 



(l-p)(l- g )[g+(l-p)(l-g)] 



Compare the computational cost of a Gibbs sampler based on this approach with 
a Gibbs sampler using the full conditionals. 



Following the decomposition of the likelihood in the previous exercise, the 
terms depending on r 2 are 

1 / q V 2 (ni-ri-r 2 )! 



r 2 !(ni - ri - r 2 )\ \{l - p)(l - q) J (rii - ri - r 2 - c 3 )! 

i f q 



r-2 



r 2 !(ni - ri - r 2 - c 3 )! \(1 -p)(l - g) 
If we sum over < r 2 < rii - ri -C3, we get 

1 / ni - ri - c 3 \ / 17 



E 



(ni-n-c 3 )! ^ V fc / \( l -p)( l - q) 



i < q 



}m—ri—c 3 

that we can agregate with the remaining terms in r\ 
(n-ri)! f <? 



ri!(ni - ri - c 2 )! [ (1 - q) 2 (l - p) 



to recover (5.1 1. 

Given that a Gibbs sampler using the full conditionals is simulating from 
standard distributions while a Gibbs sampler based on this approach requires 
the simulation of this non-standard distribution on n, it appears that one 
iteration of the latter is more time-consuming than for the former. 



Exercise 5.14 Show that the likelihood associated with an open population can 
be written as 

(e«At) it t=li=l 

X pCl-e <t M« (1 _ p) (l-e it )(l-5 it ) ) 
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where q = q, q\ = 1, and 5 it and e it are the capture and exit indicators, 
respectively. Derive the order of complexity of this likelihood, that is, the number 
of elementary operations necessary to compute it. 



This is an alternative representation of the model where each individual 
capture and life history is considered explicitely. This is also the approach 
adopted for the Arnason-Schwarz model of Section 5.5. We can thus define 
the history of individual 1 < i < N as a pair of sequences {en) and (Sa), 
where en = 1 at the exit time t and forever after. For the model given at 
the beginning of Section 5.3, there are n\ £ji's equal to 1, r\ en's equal to 1, 
C2 fe's equal to 1 among the z's for which 8n — 1 and so on. If we do not 
account for these constraints, the likelihood is of order 0(3 NT ) [there are three 
possible cases for the pair (e it , <5 it ) since Su = if e it = 1]. Accounting for the 
constraints on the total number of Sn's equal to 1 increases the complexity of 
the computation. 



Exercise 5.15 Show that, for M > 0, if g is replaced with Mg in 5^ and if 
(X, U) is uniformly distributed on , the marginal distribution of X is still g. 
Deduce that the density g only needs to be known up to a normalizing constant. 



The set 

& = {(x,u) : < u < Mg(x)} 

has a surface equal to M. Therefore, the uniform distribution on 5? has density 
1/M and the marginal of X is given by 



/ 



1 , Mg(x) , , 

l(0,Mp(x))^d«= — 



This implies that uniform simulation in 5? provides an output from g no mat- 
ter what the constant M is. In other words, g does not need to be normalised. 



Exercise 5.16 For the function g(x) = (l+sin 2 (a;))(2+cos 4 (4x)) exp[— x 4 {l + 
sin 6 (x)}] on [0,27r], examine the feasibility of running a uniform sampler on the 
associated set y. 



The function g is non-standard but it is bounded [from above] by the 
function g(x) = 6exp[— x A ] since both cos and sin are bounded by 1 or even 
g(x) — 6. Simulating uniformly over the set associated with g can thus be 
achieved by simulating uniformly over the set .5? associated with g until the 
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output falls within the set 5? associated with g. This is the basis of accept- 
reject algorithms. 

Exercise 5.17 Show that the probability of acceptance in Step 2 of Algorithm 
5.2 is 1/M, and that the number of trials until a variable is accepted has a 
geometric distribution with parameter 1/M. Conclude that the expected number 
of trials per simulation is M. 

The probability that U < g{X) /(Mf(X)) is the probability that a uniform 
draw in the set 

y = {(x,u) : < u < Mg(x)} 

falls into the subset 

<7>o={(x,u) :0<u<f(x)}. 

The surfaces of and being M and 1, respectively, the probability to fall 
into y is 1/M. 

Since steps 1. and 2. of Algorithm 5.2 are repeated independently, each 
round has a probability 1 /M of success and the rounds are repeated till the 
first success. The number of rounds is therefore a geometric random variable 
with parameter 1/M and expectation M. 



Exercise 5.18 For the conditional distribution of at derived from (5.3), con- 
struct an Accept-Reject algorithm based on a normal bounding density / and 
study its performances for N = 53, n t = 38, fi t — —0.5, and a 2 — 3. 



That the target is only known up to a constant is not a problem, as demon- 
strated in Exercise 5.15 To find a bound on ir(at\N,n t ) [up to a constant], 
we just have to notice that 



(l + e at ) 



-N 



< e 



-Na t 



and therefore 
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(1 + e a *)~ N exp |a t nt - ^(a t - ^ t ) 2 

< exp |a t (n t - iV) - ^ (a t - ^) 2 

=«"{-^ +2 0('--<' 2 ( JV -"»-^} 

= -^^expl-^(a t - ^ t + (j 2 (N -n t )) 2 



x \/2^7Cxp |-^2 (M? - \lH - <? 2 (N - n t )] 2 )| . 



The upper bound thus involves a normal ,jV{\it — cr 2 (N — n t ) , a 2 ) distribution 
and the corresponding constant. The R code associated with this decomposi- 
tion is 

# constants 
N=53 
nt=38 
mut=- . 5 
sig2=3 

sig=sqrt(sig2) 



# log target 
ta=function(x){ 

-N*log(l+exp(x))+x*nt-(x-mut) ~2/(2*sig2) 

} 



#bounding constant 
bmean=mut-sig2* (N-nt) 

uc=0 . 5*log(2*pi*sig2) + (bmean~2-mut ~2) / (2*sig2) 
prop=rnorm ( 1 , sd=s ig) +bmean 

ratio=ta(prop) -uc-dnorm (prop , mean=bmean , sd=sig , log=T) 



while (log(runif (l))>ratio){ 



prop=rnorm(l , sd=sig)+bmean 

rat io=ta (prop) -uc-dnorm (prop , mean=bmean , sd=s ig , log=T) 
} 

The performances of this algorithm degenerate very rapidly when TV — n t is 
[even moderately] large. 
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Exercise 5.19 When uniform simulation on 5? is impossible, construct a Gibbs 
sampler based on the conditional distributions of u and x. (Hint. Show that both 
conditionals are uniform distributions.) This special case of the Gibbs sampler is 
called the slice sampler (see Robert and Casella, 2004, Chapter 8). Apply to the 
distribution of Exercise [5. 161 



Since the joint distribution of (X, U) has the constant density 

t(x,u) = h<u<g(x) , 

the conditional distribution of U given X — x is ^(0, g(x)) and the condi- 
tional distribution of X given U = u is ^({x; g(x) > u}), which is uniform 
over the set of highest values of g. Both conditionals are therefore uniform 
and this special Gibbs sampler is called the slice sampler. In some settings, 
inverting the condition g(x) > u may prove formidable! 



If we take the case of Exercise 5.16 and of g(x) = cxp(— a; 4 ), the set 
{x;g(x) > u} is equal to 

{x; g(x) > u} = {x; x < (- log(x)) 1 ^ , 

which thus produces a closed-form solution. 

Exercise 5.20 Reproduce the above analysis for the marginal distribution of n 



computed in Exercise 5.13 



The only change in the codes provided in #5 . R deals with seuil, called by 
ardipper, and with gibbs2 where the simulation of r 2 is no longer required. 



Exercise 5.21 Show that, given a mean and a 95% confidence interval in [0, 1], 
there exists at most one beta distribution 3§(a, b) with such a mean and confidence 
interval. 



If < m < 1 is the mean m = a/(a + b) of a beta 3§e(a, b) distribution, 
then this distribution is necessarily a beta fe(am,a(l — m)) distribution, 
with a > 0. For a given confidence interval [£,u], with < £ < m < u < I, we 
have that 

[since, when a goes to zero, the mass of the beta ^e(am, a(l—m)) distribution 
gets more and more concentrated around and 1, with masses (1 — m) and 
m, respectively] and 
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lim f r ^ x am ~ 1 (l-x) a( - 1 ~ m ^~ 1 dx = l 

a— >oo J t r(am)r(a(l — m)) 

[this is easily established using the gamma representation introduced in Ex- 



4.19 and the law of large numbers]. Therefore, due to the continuity [in 



a] of the coverage probability, there must exist one value of a such that 

B(£, u\a, m) = f — , x am -\l - x )^^ dx = 0.9 . 

J i 1 (am)l {a(l — m) 



Figure 5.2 illustrates this property by plotting B(£, u\a, m) for I — 0.1, u 
0.6, m — 0.4 and a varying from 0.1 to 50. 




Fig. 5.2. Coverage of the interval (I, u) — (0.1, 0.6) by a ,^e(0.4a, 0.6a) distribution 
when a varies. 



Exercise 5.22 Show that groups of consecutive unknown locations are indepen- 
dent of one another, conditional on the observations. Devise a way to simulate 
these groups by blocks rather than one at a time, that is, using the joint posterior 
distributions of the groups rather than the full conditional distributions of the 
states. 



As will become clearer in Chapter 7, the Arnason-Schwarz model is a very 
special case of [partly] hidden Markov chain: the locations ^(i.t) of an indi- 
vidual i along time constitute a Markov chain that is only observed at times 
t when the individual is captured. Whether or not Z(i,t) is observed has no 
relevance on the fact that, given z^^, (zu t t—i)> 2(i,t— 2)> • • •) is independent 
from (z^t+i), Z(i,t+2)) ■ ■ •)■ Therefore, conditioning on any time t and on the 
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corresponding value of z[^ t ) makes the past and the future locations indepen- 
dent. In particular, conditioning on the observed locations makes the blocks 
of unobserved locations in-between independent. 

Those blocks could therefore be generated independently and parallely, 
an alternative which would then speed up the Gibbs sampler compared with 
the implementation in Algorithm 5.3. In addition, this would bring additional 
freedom in the choice of the proposals for the simulation of the different blocks 
and thus could further increase efficiency. 



6 



Mixture Models 



Exercise 6.1 Show that a mixture of Bernoulli distributions is again a Bernoulli 
distribution. Extend this to the case of multinomial distributions. 



By definition, if 

k 

x ~ ^p l S§{q i ) , 
i=l 

then x only takes the values and 1 with probabilities 

- qi) = 1 - ^2,Piq% and ^p l q l , 

i—1 i—1 i—1 

respectively. This mixture is thus a Bernoulli distribution 

When considering a mixture of multinomial distributions, 

k 

1=1 

with <\i = (qn, . . . ,qik), x takes the values 1 < j < k with probabilities 

k 

i=l 

and therefore this defines a multinomial distribution. This means that a mix- 
ture of multinomial distributions cannot be identifiable unless some restric- 
tions are set upon its parameters. 
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Exercise 6.2 Show that the number of nonnegative integer solutions of the 
decomposition of n into k parts such that m + . . . + = n is equal to 

n + k — 
n 

Deduce that the number of partition sets is of order 0(n fe_1 ). 



This is a usual combinatoric result, detailed for instance in Feller (1970). 
A way to show that r is the solution is to use the "bottomless box" trick: 
consider a box with k cases and n identical balls to put into those cases. If we 
remove the bottom of the box, one allocation of the n balls is represented by 
a sequence of balls (O) and of case separations (|) or, equivalently, of 0's and 
l's, of which there are n and k — 1 respectively [since the box itself does not 
count, we have to remove the extreme separations]. Picking n positions out of 
n + (k — 1) is exactly r. 

This value is thus the number of "partitions" of an n sample into k groups 
[we write "partitions" and not partitions because, strictly speaking, all sets 
of a partition are non-empty]. Since 



n + k-1 



(n + k-l)\ 



,fc-i 



n\{k-l)\ 0-1)!' 
when n ^> k, there is indeed an order 0(n *) of partitions. 



Exercise 6.3 For a mixture of two normal distributions with all parameters un- 
known, 

P Af{fH,al) + (1 - p)M{m, <4) , 
and for the prior distribution (j = 1,2) 

HiWj ~ Sitj'OiM > ^ ~ S&(vj/2, s]/2) , p ~ #e(a, (3) , 
show that 

p|x,z ~ &e(a + £i,/3 + e 2 ), 

(JtjWi,*,! ~ JT (ti(z), ^rf^j > ajM ~ SV(( Vj +£j)/2, Sj (z)/2) 

where £j is the number of Zi equal to j, Xj(z) and Sj(z) are the empirical mean 
and variance for the subsample with Zi equal to j, and 



Compute the corresponding weight u(z) 
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If the latent (or missing) variable z is introduced, the joint distribution of 
(x, z) [equal to the completed likelihood] decomposes into 



n^/o*)=n n 

1=1 j = H; Zi =j 



n 



-(^-^■) 2 /2<t 3 2 



(6.1) 



where pi — p and p 2 = (1 — p). Therefore, using the conjugate priors proposed 
in the question, we have a decomposition of the posterior distribution of the 
parameters given (x,z) in 

2 



P 



i+a-l 



(1-P) 



-(xi-Mi) 2 /2<r? 



-7r(/ij,CTj) • 



This implies that p|x,z <~ J?e(a + £i,/3 + -£ 2 ) and that the posterior distri- 
butions of the pairs (/Xj, cr?) are the posterior distributions associated with 
the normal observations allocated (via the Zj's) to the corresponding compo- 
nent. The values of the hyperparameters are therefore those already found in 
Chapter 2 (see, e.g., eqn. (4.6) and Exercise 2.13). 

The weight w(z) is the marginal [posterior] distribution of z, since 

7r(0,p|x) = ^w(z)tt(0,p|x,z) . 



Therefore, if p\ = p and P2 = 1 — p, 

2 



~^\f\\i ; , n 



e 



j=l i;z i= j 

r(a + £ 1 )r((3 + £ 2 ) 



(Ji 



-w(0,p) &0dp 



r(a + f3 + n) 
2 

n cxp 



^2 {( n J + - 0(z)) 2 + «j 



(z)} 



d<9 



r(a + 4)r(/3 + ^ 2 ) /-((£,- + ^-)/2)( Sj -(z)/2)^+^)/ 2 



r(a + /3 + n) 



\/ n i + ^ 



and the proportionality factor can be derived by summing up the rhs over all 
z's. (There are 2 n terms in this sum.) 



Exercise 6.4 For the normal mixture model of Exercise 6.3, compute the func- 
tion Q(8o,9) and derive both steps of the EM algorithm. Apply this algorithm to 
a simulated dataset and test the influence of the starting point 9q. 
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Starting from the representation (6.1l above, 



logt(O,p\x,z) = J2{ I i(zi)teg(pf(x i \0 1 )+h{z i )log((l-p)f(x i \6 2 )} 

i=l 

which implies that 

Q{(6^, p^), (6, p)} = E (e(t);pM) [log £(6, P \k,z)\x] 

71 

= E { p (e< t ),p( t >) ( z i = l|x)log(p/(xi|0i) 

i=l 

+P(9W,p(«)) 0* = 2 I X ) lo s((i - p) f(xi\e 2 )} 

n 

= log(pM) X] P (0C*>,pM) 0* = !| x ) 



•log((l-p)/CT 2 )X] p (ew,pW) 0* = 2 I X ) 

i=l 

X/ P (ew,p(')) = 1 I X ) 



2a? 



2^ p (ew,pW) 0* = 2 ' x ) 



2af 



If we maximise this function in p, we get that 



,(t+i) 



1 - 

= " E P (e (f) .p (t) ) (** = 1 ' x ) 



n 

A E 



n i^pMf(x i \6®) + (l- I m)f(x i \0W) 
while maximising in (fij,~j) (J — 1,2) leads to 



(*+i) 

f-s 



E P (e (t) >P (1) ) 0* = il x ) ^ / E P (e (t) ,p (t) ) Oi =il x ) 



' i=l 

xivffix^ef) 



a 



2(*+l) 



1 " 

«P? +1) s^/^iei'^ + a-pW)/^!^) ' 

E P (0 (t) >p (t) ) 0; = il x ) Oi / E P (0 w ,p (t) ) ( Zl = - ? 'l x ) 



(*+i) 



E 



(t+i) 



pff{xi\ef) 



ip®f{x i \9?) + {l--®)f{x i \0™) 
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where pY = p^ and p^ — (1 — p^). 

A possible implementation of this algorithm in R is given below: 

# simulation of the dataset 
n=324 

tz=sample(l :2,n,prob=c( .4, .6) ,rep=T) 
tt=c(0,3.5) 
ts=sqrt(c (1.1,0.8)) 
x=rnorm(n,mean=tt [tz] , sd=ts [tz] ) 

para=matr ix (0 , ncol=50 , nrow=5) 
likem=rep (0 , 50) 

# initial values chosen at random 

para[, 1] =c (runif (1) ,mean(x)+2*rnorm(2) *sd(x) ,rexp(2) *var (x) ) 
likem [1] =sum(log( para[l , 1] *dnorm(x,mean=para[2, 1] , 

sd=sqrt (para [4, 1] ) )+(l-para[l , 1] ) *dnorm(x,mean=para[3 , 1] , 

sd=sqrt (para [5, 1] ) ) )) 

# 50 EM steps 
for (em in 2:50){ 

# E step 

postprob=l/ ( l+(l-para[l,em-l] )*dnorm(x,mean=para[3,em-l] , 
sd=sqrt (para [5, em- 1] ) ) / ( para [1 ,em-l] *dnorm(x, 
mean=para[2,em-l] ,sd=sqrt(para[4,em-l] ))) ) 

# M step 

para [1 , em] =mean(postprob) 

para [2, em] =mean(x*postprob) /para [1 , em] 

para [3 , em] =mean (x* ( 1-postprob) ) / ( 1-para [1 , em] ) 

para [4, em] =mean( (x-para[2, em] ) ~2*postprob) /para[l , em] 

para[5,em] =mean((x-para[3,em] ) ~2* (1-postprob) ) / (l-para[l,em] ) 

# value of the likelihood 

likem [em] =sum( log (para [1 , em] *dnorm(x,mean=para [2 , em] , 

sd=sqrt(para[4,em] ))+(l-para[l,em] )*dnorm(x,mean=para[3,em] , 
sd=sqrt(para[5,em] )) )) 

} 

Figure [Q] represents the increase in the log- likelihoods along EM iterations 
for 20 different starting points [and the same dataset x]. While most starting 
points lead to the same value of the log-likelihood after 50 iterations, one 
starting point induces a different convergence behaviour. 
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10 20 30 40 

iterations 



Fig. 6.1. Increase of the log-likelihood along EM iterations for 20 different starting 
points. 



Exercise 6.5 Show that the dj's in model (6.2) are dependent on each other 
given (only) x. 



The likelihood associated with model (6.2) being 



*(Mx)=n 



i=l 



J^Pj f( x i\ e j) 



it is clear that the posterior distribution will not factorise as a product of 
functions of the different parameters. It is only given (x,z) that the 6j's are 
independent. 



Exercise 6.6 Construct and test the Gibbs sampler associated with the (£, fj,o) 
parameterization of (6.3) when ii x = /i — £ and ^2 = Mo + ?■ 



The simulation of the z^s is unchanged [since it does not depend on the 
parameterisation of the components. The conditional distribution of (£,Mo) 
given (x,z) is 



tt(£>Mo|x,z) oc exp 



y\ ( Xi ~ ^ + Z) 2 + E (*< -1*°- 2 \ 

lzj=l z i= 2 ) 
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Therefore, £ and /io are not independent given (x,z), with 

, ;nx + (t 1 -t 2 )i i' 

/Xo|£,x,z ~ <yf ' 



n ' n 



The implementation of this Gibbs sampler is therefore a simple modifica- 
tion of the code given in #6 . R on the webpage: the MCMC loop is now 

for (t in 2:Nsim){ 



# allocation 

f act= . 3*sqrt (exp (gul~2-gu2~2) ) / . 7 
probs=l/ ( 1+f act*exp(sampl* (gu2-gul) ) ) 
zeds=(runif (N) <probs) 



# Gibbs sampling 

muO=rnorm(l)/sqrt (N)+(sum(sampl)+xi* (sum(zeds==l) 

-sum(zeds==0) ) )/N 
xi=rnorm(l) / sqrt (N)+(sum(sampl [zeds==0] -muO) 

-sum(sampl [zeds==l] -muO) ) /N 



# reparameterisation 

gul=muO-xi 

gu2=mu0+xi 

muz [t , ] = (c (gul , gu2) ) 



} 

If we run repeatedly this algorithm, the Markov chain produced is highly 
dependent on the starting value and remains captive of local modes, as illus- 
trated on Figure |6.2| This reparameterisation thus seems less robust than the 
original parameterisation. 



Exercise 6.7 Give the ratio corresponding to (6.7) when the parameter of inter- 
est is in [0, 1] and the random walk proposal is on the logit transform log#/(l— 6). 



Since 

9 log [0/(1-0)} 



30 OL ' v " 6 1-6) 9(1-6) 

the Metropolis-Hastings acceptance ratio for the logit transformed random 
walk is 




Fig. 6.2. Influence of the starting value on the convergence of the Gibbs sampler 
associated with the location parameterisation of the mean mixture (10,000 itera- 
tions). 
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MM gj(i-gj) A1 

Exercise 6.8 Show that, if an exchangeable prior it is used on the vector of 
weights (pi, . . . ,pk) , then, necessarily, E*"^] = 1/fc and, if the prior on the 
other parameters (6\, . . . , 9k) is also exchangeable, then E^jlxi, . . . , x n ] =l/k 
for all j's. 




If 

7r(pi, . ..,pk) = 7r(p CT( i), . . . ,p CT ( fe) ) 
for any permutation cr € 6^, then 




,...,p fc )dP = E ,r [pi]. 



Given that = 1> this implies E 17 ^-] = 1/fc. 

When both the likelihood and the prior are exchangeable in (pj ,6j), the 
same result applies to the posterior distribution. 

Exercise 6.9 Show that running an MCMC algorithm with target 7r(6*|x) 7 will 
increase the proximity to the MAP estimate when 7 > 1 is large. Discuss the 
modifications required in Algorithm 6. 2. to achieve simulation from 7r(6*|x) 7 when 

7€N*. 




The power distribution 7r 7 (0) oc 7r(0) 7 shares the same modes as n, but the 
global mode gets more and more mass as 7 increases. If 9* is the global mode of 
7T [and of 7r 7 ], then {n(9) /tt(9*)} 7 goes to as 7 goes to 00 for all 0's different 
from 9*. Moreover, for any < a < 1, if wc define the a neighbourhood ^q, 
of 9* as the set of 0's such that ir(6) > aw(6*), then 7r 7 (9t a ) converges to 1 
as 7 goes to 00. 

The idea behind simulated annealing is that, first, the distribution 7r 7 (#) oc 
7r(0) 7 is more concentrated around its main mode than ir(9) if 7 is large and, 
second, that it is not necessary to simulate a whole sample from tt(9), then a 
whole sample from tt(9) 2 and so on to achieve a convergent approximation of 
the MAP estimate. Increasing 7 slowly enough along iterations leads to the 
same result with a much smaller computing requirement. 

When considering the application of this idea to a mean mixture as (6.3) 
[in the book], the modification of Algorithm 6.2 is rather immediate: since 
we need to simulate from 7r(#,p|x) 7 [up to a normalising constant], this is 
equivalent to simulate from 1(9, p\x) 7 x 7r(0,p) 7 . This means that, since the 
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prior is [normal] conjugate, the prior hyperparameter A is modified into 7A 
and that the likelihood is to be completed 7 times rather than once, i.e. 

e(8, P \xF= (/ /(x,z|0,p)d z y = J[Jf( x ,z j \e,p)dz j . 

Using this duplication trick, the annealed version of Algorithm 6.2 writes as 



Algorithm Annealed Mean Mixture Gibbs Sampler 
Initialization. Choose /4°' and fx^\ 
Iteration t (t > 1). 

1. For i = 1, . . . , n, j = 1, . . . ,7, generate zf) from 

F(zij = 1) oc p cxp j-i (xi - j 
P(z« = 2) cx (1-p) exp|-i (ari-^* - ^) 2 } 

2. Compute 

7 n 7 n 

J = 1 Z— 1 J — 1 l — l 

or 4. ( f ) f v( f XS + barxi (z) 1 \ 

3. Generate n\ from JY : , — 

n V l x + i 7A + ^/ 

a r 4. (*) f ^ f 7^ + 52 (z) 1 

4. Generate ' from JY — — 

^ \ 7A + 7n - £ 7A + 7n - £ 



This additional level of completion means that the Markov chain will have 
difficulties to move around, compared with the original Gibbs sampling algo- 
rithm. While closer visits to the global mode are guaranteed in theory, they 
may require many more simulations in practice. 



Exercise 6.10 In the setting of the mean mixture (6.3), run an MCMC simula- 
tion experiment to compare the influence of a jY (0, 100) and of a JY (0, 10000) 
prior on (/ii,^) on a sample of 500 observations. 




This is straightforward in that the code in #6 . R simply needs to be modi- 
fied from 
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# Gibbs samplin 

gul=rnorm(l)/sqrt ( . l+length(zeds [zeds==l] ) ) + 
( sum ( samp 1 [zeds==l] ) )/ ( . l+length(zeds [zeds==l] ) ) 

gu2=rnorm(l)/sqrt ( . l+length(zeds [zeds==0] ) ) + 
(svrm(sampl [zeds==0] ) )/( . l+length(zeds [zeds==0] ) ) 

to 

# Gibbs samplin 

gul=rnorm(l)/sqrt ( . 01+length(zeds [zeds==l] ) ) + 

(sum(sampl [zeds==l] ) )/( . 01+length(zeds [zeds==l] ) ) 

gu2=rnorm(l)/sqrt ( . 01+length(zeds [zeds==0] ) ) + 

(sum(sampl [zeds==0] ))/( . 01+length(zeds [zeds==0] ) ) 

for the o/f(0, 100) prior and to 

# Gibbs samplin 

gul=rnorm ( 1 ) /sqrt ( . 0001+length (zeds [zeds==l] ) ) + 

(sum(sampl [zeds==l] ) ) / ( . 0001+length (zeds [zeds==l] ) ) 

gu2=rnorm(l)/sqrt( . 0001+length(zeds [zeds==0] ))+ 

(sum(sampl [zeds==0] ) ) / ( . 0001+length (zeds [zeds==0] ) ) 

for the ^(0, 10 4 ) prior. While we do not reproduce the results here, it appears 
that the sampler associated with the ^(0, 10 4 ) prior has a higher probability 
to escape the dubious mode. 



Exercise 6.11 Show that, for a normal mixture 0.5 (0,1) + 0.5 ^V(fi, cr 2 ), 
the likelihood is unbounded. Exhibit this feature by plotting the likelihood of a 
simulated sample, using the R image procedure. 



This follows from the decomposition of the likelihood 



£(0| X ) = n 



J2o.5.f(x t \0,) 



3=1 



into a sum [over all partitions] of the terms 

n /(»<!*,)= n ^ n ^7 )M 

i—l i\Zi — l i;Zi—2 

In exactly n of those 2" partitions, a single observation is allocated to the 
second component, i.e. there is a single i such that Zi = 2. For those particular 
partitions, if we choose fj, = Xi, the second product reduces to 1/cr which is 
not bounded when a goes to 0. Since the observed likelihood is the sumc of 
all those terms, it is bounded from below by terms that are unbounded and 
therefore it is unbounded. 

An R code illustrating this behaviour is 
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# Sample construction 
N=100 

sampl=rnorm (N) + (runif (N) < . 3) *2 . 7 

# Grid 

mu=seq(-2 .5,5.5, length=250) 

sig=rev(l/seq( . 001 , . 01 , length=250) ) # inverse variance 

mol=muy*y o t (rep(l , length=length(sig) ) ) 

mo2=(rep(l , length=length(mu) ) )°/,*°/,t (sig) 

cal=-0 . 5*mol~2*mo2 

ca2=mol*mo2 

ca3=sqrt (mo2) 

ca4=0.5*(l-mo2) 

# Likelihood surface 
like=0*mol 

for (i in 1:N) 

like=like+log(l+exp(cal+sampl [i] *ca2+sampl [i] ~2*ca4) *ca3) 
like=like-min(like) 

sig=rev(l/sig) 

image (mu, sig, like ,xlab=expression(mu) , 

ylab=expression(sigma~2) , col=heat . colors (250) ) 
contour (mu, sig, like , add=T,nlevels=50) 



and Figure 6.3 exhibits the characteristic stripes of an explosive likelihood as 



a approaches for values of fi close to the values of the sample. 



Exercise 6.12 Show that the ratio (6.8) goes to 1 when a goes to when the 
proposal q is a random walk. Describe the average behavior of this ratio in the 
case of an independent proposal. 



This is obvious since, when the proposal is a random walk [without 
reparameterisation] , the ratio q(9, p\9' ,p') /q(6' ,p'|0,p) is equal to 1. The 
Metropolis-Hastings ratio thus reduces to the ratio of the targets to the power 
a, which [a.e.] converges to 1 as a goes to 0. 

In the case of an independent proposal, 

7T(0',p'|x)V q(8, p) 



7r(0,p|x) J q(d',p>) 

is equivalent to q(9,p)/q(9' ,p') and therefore does not converge to 1. This 
situation can however be avoided by picking q a rather than q, in which case 
the ratio once more converges to 1 as a goes to 0. 
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Exercise 6.13 If one needs to use importance sampling weights, show that 
the simultaneous choice of several powers a requires the computation of the 
normalizing constant of 7r Q . 



If samples (9i a )i from several tempered versions ir a of tt are to be 
used simultaneously, the importance weights associated with those samples 
7r(^ a ) /ir a (0i a ) require the computation of the normalizing constants, which 
is most often impossible. This difficulty explains the appeal of the "pumping 
mechanism" of Algorithm 6.5, which cancels the need for normalizing con- 
stants by using the same 7T a twice, once in the numerator and once in the 
denominator (see Exercice 6.141. 



Exercise 6.14 Check that Algorithm 6.5 does not require the normalizing con- 
stants of the Tr ai 's and show that tt is the corresponding stationary distribution. 



Since the acceptance probability 

7T ai (*<*>) *a f (x®i) ^ P - 1 (4 t) ) ^S-l) 



min < 1 , 



88 6 Mixture Models 

uses twice each power aj of n, the unknown normalizing constants of the ir ai 's 
vanish, which is one of the main reasons for using this algorithm. 

The fact that n is stationary can be derived from the "detailed balance" 
condition: let us assume that each of the MCMC kernels satisfy the corre- 
sponding detailed balance equation 

7r a .(xo)MCMC(xi|x ,7r Q .) = 7r a .(xi)MCMC(x |xi,7r a .) . 

Then 

7r(zW)MCMC(a£ ) |a; (t) ,'ra 1 ) ■ ■ -MCMC^glxg-i,^) 



x mm 



I ' < x(t) ) '"^.,(4-1) ^(4°) '""^(42-1) 

7 r Ql (x( t ))MCMC(xf ) |xW, 7r Ql ) • • • MCMC(xg|x£-i> 

' 7t(x«) 7T Q2 (xW) ^(xgU) \ 



x mm 



= MCMC(x w \x { * } , tt Qi )tt Qi (x^ )MCMC(4* ) , 7r aa ) ■ ■ ■ 
xMCMC^|xg_ 1 , 7rQl )min/^^ 7 ^ - - - 



MCMC(x (t) 1 4* } , Jr ai jMCMCfif |x^ } , tt Q2 )tt Q2 (x^ ) • • • 



xMCMC(x^|x^_ 1 ,7r Ql )min 



7T ai (x(*))7T a2 (xf ) ) n a2 (x^) 



= MCMC(xW|x^ , n ai ) • • • MCMC(xg_ 1 |xg , n ai )7r(xg) 



x mm ■ 



7T(XW) ^(xg) 

^W" .(xg) ' , 



by a "domino effect" resulting from the individual detailed balance condi- 
tions. This generalised detailed balance condition then ensures that n is the 
stationary distribution of the chain (x^) t . 



Exercise 6.15 Show that the decomposition (6.9) is correct by representing 
the generic parameter 9 as (k,0k) and by introducing the submodel marginals, 




In a variable dimension model, the sampling distribution is 
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f(xi, ...,x n \0) = f(x!,. . .,x n \{k,0 k )) = fk(xi, . . .,x n \0 k ). 
Decomposing the prior distribution as 

n(9) = n((k,0 k )) = P(Wl k )w k (0 k ), 
the joint distribution of (x\, . . . , x n ) and of 9 is 

f( Xl ,. . . , x n \9)ir(9) = P(m k )ir k (e k )f k ( Xl , . . .,x n \0 k ) 
and the marginal distribution of (x\, . . . , x n ) is therefore derived by 

m(xi, ...,x n ) = y~] / P{9Jlk)n k {0 k )f k {xi, . . .,x n \9 k )d9 k 
k •* 

= X] P(W k )m k ( Xl , ...,x n ). 

k 

Similarly, the predictive distribution f(x\x\, . . . , x n ) can be expressed as 
f(x\x 1 , ...,x n )=f /(a;|0)7r(0|:ri, ...,x n )d0 



p(m k )7T k (e k ) 

m(x 1 , ...,x n ) 



d0 k 



E 

k 



P(Wl k )m k (xi, ...,x n ) 



7Tfc(6>fc) 



, fk{x\0k) rd(9 fc 

m{x 1 ,...,x n ) J m k (x 1 ,...,x n ) 



XP^fcl^i,---^™) / fk(x\9 k )ir k (9 k \ Xll . . . ,x n )d0 k . 



Therefore, 



E[x|xi,. . . ,x n ] = ^2P(M k \x 1 , ...,x n ) j j f k {x\9 k )TT k {9 k \ Xl , . . .,x n )d9 k . 



Exercise 6.16 For a finite collection of submodels Wl k (k = l,...,K), with 
respective priors n k (9 k ) and weights g k , write a generic importance sampling 
algorithm that approximates the posterior distribution. 



The formal definition of an importance sampler in this setting is straight- 
forward: all that is needed is a probability distribution (u>i, . . . , ujk) and a 
collection of importance distributions r\ k with supports at least as large as 
supp(7r fc ). The corresponding importance algorithm is then made of the three 
following steps: 
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Algorithm Importance Sampled Model Choice 

1. Generate k <~ (ui, . . . ,0Jk)\ 

2. Generate 9 k ~ r]k(0 k ); 

3. Compute the importance weight Qk^k{0k) / ^kTjk{0k) ■ 



Obviously, the difficulty in practice is to come up with weights Uk that are 
not too different from the Qk's [for efficiency reasons] while selecting perti- 
nent importance distributions rjk- This is most often impossible, hence the 
call to reversible jump techniques that are more local and thus require less 
information about the target. 



Exercise 6.17 Show that, if we define the acceptance probability 

tt 2 (x') q(x\x') 

6 = — TT i n \ A 1 
TTi(x) q(x'\x) 

for moving from x to x' and 

/ q(x'\x) 

ir 2 (x') q(x\x') 

for the reverse move, the detailed balance condition is modified in such a way that, 
if X t ~ tti(x) and if a proposal is made based on q(x\x t ), X t+ i is distributed 
from 7T2(x). Relate this property to Algorithm 6.5 and its acceptance probability. 



If K denotes the associated Markov kernel, we have that 

iti(x)K(x,x') = ni(x) ^q(x'\x)g(x,x') + 5 x (x') j q(z\x)[l - g(x,z)]dz^ 
= min {iri(x)q(x'\x), ni(x')q(x\x / )} 
+ 4(x')/ m ax { 0,^|,K(,)^(,|^ 2 (,)}d, 

= TT 2 (x)K(x / , X) 

under the assumption that the reverse acceptance probability for the reverse 
move is as proposed [a rather delicate assumption that makes the whole ex- 
ercise definitely less than rigorous]. 

In Algorithm 6.5, the derivation is perfectly sound because the kernels are 
used twice, once forward and once backward. 
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Exercise 6.18 Show that the marginal distribution of p\ when (pi, ■ ■ ■ ,Pk) ~ 

@k(o>i, ■ ■ ■ , o-k) is a 3§e(ai 7 a 2 + . . . + eifc) distribution. 



This result relics on the same representation as Exercise 4.19 if (pi, . . . ,pk) 
$>k{a-i, ■ ■ ■ jflfe)) Pi is distributed identically to 



~ ^a{a 3 ) . 



Since £ 2 + ... + & ~^ 
. . . + cifc) distribution. 



a(a2 + • • • + afe), this truly corresponds to a 38e(a\, a 2 - 



7 



Dynamic Models 



Exercise 7.1 Consider the process (x t )t£i defined by 

x t = a + bt + y t , 

where (yt)tez is an iid sequence of random variables with mean and variance 
a 2 , and where a and b are constants. Define 



w t = (2(7 + 1) 



g 

J=-« 



Compute the mean and the autocovariance function of (w t ) te2 . Show that 
(wt)tgz is not stationary but that its autocovariance function j w (t + h,t) does 
not depend on t. 



We have 



E[w t ] = E 



(2Q + 1)- 1 £ 



X t +j 



= (2q + I)" 1 ]T E [a + b(t + j) + y t ] 

j=-q 

= a + bt. 

The process (tut)tez is therefore not stationary. Moreover 
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E[w t w t+h ] = E 



a+bt+ - ^ - yt+J ] \a + bt + bh+ ^ 3/t+fc+j 



j=-q 



j=-q 



Then, 
and, 



= (a + bt)(a + bt + bh) +E _ _^ 

j=-q j=-q 
= (a + bt)(a + bt + bh) + l\ h \< q {q + 1 - \h\)a 2 . 

cov(w t ,w t+h ) = l\h\< q (q + 1 - \h\)<r 2 
lw {t + h,t) = l ]h] < q {q+l-\h\)a 2 . 



Exercise 7.2 Suppose that the process (xtjten is such that xq ~ ,/K(0,t 2 ) 
and, for all t e N, 

x t+ i\x :t ~ ^(a; t /2,a 2 ) , a > . 
Give a necessary condition on t 2 for (x t ) teN to be a (strictly) stationary process. 



We have 



Moreover, 



E[xi] = E[E[xi|x ]] = E[x /2] = . 



V(a;i) = ¥(E[xi \x }) + E[V(xi \x )} = r 2 /4 + a 2 . 

Marginaly, x\ is then distributed as a =yK(0, r 2 /4+er 2 ) variable, with the same 
distribution as x only if t 2 /4 + cr 2 = t 2 , i.e. if t 2 = 4er 2 /3. 



Exercise 7.3 Suppose that (x t )tefi is a Gaussian random walk on E: xo ~ 
■yV(0,T 2 ) and, for all t e N, 

x t +i|x :t ~ ^(x^ct 2 ) , a > . 

Show that, whatever the value of r 2 is, (x t ) t£ N is not a (strictly) stationary 
process. 



We have 
Moreover, 



E[xi] =E[E[a; 1 |a;o]] = E[x ] = 0. 
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V(a?i) = V(E[xi|x ]) + E[V(xi|x )] = ^ 2 + <r 2 . 
The marginal distribution of xi is then a yy(0,T 2 + a 2 ) distribution which 
cannot be equal to a ^V(0, r 2 ) distribution. 



Exercise 7.4 Consider the process (xt)t^ such that xo = and, for all t S N, 

a;t + i|x 0:t ~ ^(gx^a 2 ) . 

Suppose that 7r(g, <r) = l/er and that there is no constraint on g. Show that the 
conditional posterior distribution of g, conditional on the observations x 0: t and 
on a 2 is a ^ (/j,t , uij*) distribution, with 



2_. x t-\xt j 2^, x t-i anc ' u t = °" 2 / x t-i 



Show that the marginal posterior distribution of g is a Student 3~(T — 1, /z^, Vj) 
distribution, with 



T 



\t=i 1 t=o 



2 2 

X t flrp 



Apply this modeling to the AEGON series in Eurostoxx 50 and evaluate its pre- 
dictive abilities. 



Warning! The text above replaces the text of Exercise |7.4| in the first printing 
of the book, with T — 1 degrees of freedom instead of T and a new expression 
for v\. 

The posterior conditional density of g is proportional to 

T 

]Jcxp{-(a; t - (?x t _i) 2 /2cr 2 } 

T-l T-l 

-g 2 ^2 x t + 2 Q x t x t+i 



cx exp ■ 



t=o 



t=o 



which indeed leads to a jY{jiTt Wj) conditional distribution as indicated 
above. 

Given that the joint posterior density of {g, a) is proportional to 



o- T - x n ex P H^t - ?^i) 2 /2a 2 } 



integrating out cr leads to a density proportional to 
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J (a 2 )- T/2 - 1/2 cxp (j2(x t - px t ^) 2 /(2a 2 )^ Aa 
= J (0" r/2_1 exp ^(x t -p^ 1 ) 2 /(2^ 2 )^ da 



' T ~| 

.t=l J 



-T/2 



when taking into account the Jacobian. We thus get a Student 3~(T — 
1,Ht,Vt) distribution and the parameters can be derived from expanding 
the sum of squares: 



T-l T 
t=l t=0 t=l 



^Ot - QXt-i) 2 = ^x 2 t (g 2 -2qh t ) +^x 2 t 



into 



T-i T T-l 

t=0 t=l t=0 



The main point with this example is that, when p is unconstrained, the 
joint posterior distribution of (g,<r) is completely closed-form. Therefore, the 
predictive distribution of x T+ i is given by 

/ J— CXp{-(x T+ l - QX T ) 2 /2(T 2 } 7r(CT, £»|x 0: T)dCTdp 

J V27rcr 

which has again a closed-form expression: 
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/ 7= cxp{ - (x T +i - Qx t ) 2 /2<t' 2 }'k(<t, g\x :T)dcrdQ 
J Vzna 

oc / o-~ T ~ 2 cxp{-y^(a; t+ i - gx t ) 2 /2a 2 }dadg 

, T \ -(T+l)/2 



oc 



oc 



\t=0 / " K 

, T ^ -(T+l)/2 

cx I > x: i i/^ T_1 



E< 

E x ? E - ^ E *txt+\ \ 

t=o t=o { t=a ) 



This is a Student T(T, <5t,o>t) distribution, with 

T-l T-l 



5 T = x T ^ xtxt+i/ E x\ = p T x T 



t=o t=o 

and 



T-l 



"t= E^E*?- E^+0 r / T E 



2 



t=o t=o Vt=o / I ' t=o 



The predictive abilities of the model are thus in providing a point estimate for 
the next observation xt+i = Ptxt, and a confidence band around this value. 



Exercise 7.5 Give the necessary and sufficient condition under which an AR(2) 
process with autoregressive polynomial V(u) = 1 — Q\u — Q2U 2 (with g 2 ^ 0) is 
causal. 



The AR(2) process with autoregressive polynomial V is causal if and only 
if the roots of V are outside the unit circle in the complex plane. The roots 
of V are given by 

_ _ -Qi ~ v / (?i + 4 £'2 , + _ -Qi + Vel + 4 £>2 
-2g 2 -2^2 



with the convention that s/x = t\/—x € C if x < 0. (Because of the symmetry 
of the roots wrt pi, the causality region will be symmetric in p\.) 
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A first empirical approach based on simulation is to produce a sample 
°f (f?i? f?2)'s over the sphere of radius 6 (6 is chosen arbitrarily and could be 
changed if this is too small a bound) and to plot only those (pi , ^2) 's for which 
the roots u~ and u + are outside the unit circle. 

# Number of points 
N=10~4 



# Sample of rho's 

rhol=rnorm(N) 

rho2=rnorm(N) 

rad=6*runif (N)/sqrt(rhol~2+rho2~2) 

rhol=rad*rliol 

rho2=rad*rho2 

R=matrix(l ,ncol=3 ,nrow=N) 

R[,2]=-rhol 

R[,3]=-rho2 



roots=apply (R, 1 ,polyroot) 

indx=(l:N) [(Mod(roots [1 , ] ) >1)] 

indx=indx [ (Mod (root s [2 , indx] ) > 1 ) ] 

plot (rhol [indx] ,rho2[indx] ,col="grey" ,cex=.4, 

xlab=expression(rho [1] ) ,ylab=expression(rho [2] ) ) 



The output of this program is given on Figure 7.1 but, while it looks like a 
triangular shape, this does not define an analytical version of the restricted 
parameter space. 

If we now look at the analytical solution, there are two cases to consider: 



either g\ + 4g 2 < and the roots are then complex numbers, or q\ 
and the roots are then real numbers. 
If q\ + 4g 2 < 0, then 

2 

-Qi ±i^-{g\ + Aq 2 ) 



4p 2 > 



2g2 



> 1 



implies that —1 < Q2, which, together with the constraint g\ 
provides a first region of values for (gi, £2): 

Cl = {(Qi, Q2)\ \Ql\ < 4, Q2 < -0l/4} • 

If q\ + 4q2 > 0, then the condition | m ± | > 1 turns into 



4^2 < 0, 



\J q\ +4(? 2 -|f?i| > 2p 2 if £>2>0 and| £ »i|-y / (? 2 + 4e2 > -2q 2 if Q 2 < 

Thus, this amounts to compare \f q\ + 4g2 with 2|g 2 + or, equivalcntly, 
4^2 with Aq\ + 4\q\Q2\, ending up with the global condition g 2 < 1 — | £?i - The 
second part of the causality set is thus 
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o 




Fig. 7.1. Acceptable values of (pi,p2) for the AR(2) model obtained by simulation. 

c 2 = {(ei, 02); < 4, g 2 < -e?/4} , 

and cumulating both regions leads to the triangle 

? = {(Ql, 02); 02 > -1, £>2 < 1 - |ei|} • 



Exercise 7.6 Show that the stationary distribution of x_ p ._i is a =y^(/il p ,A) 
distribution, and give a fixed point equation satisfied by the covariance matrix A. 



If we denote 
then 

Therefore, 
and 



z t = (x t ,x t -i, . . .,x t+ i-p) , 
z t+ i = ,ulp + B (z t - ,ul p ) + e t+1 . 
E [z t+ i|z t ] = /ilp + B (z t - /il p ) 



V(z t+1 |z t )=V(e t+ i) 



a 2 0...0 
... 

... 



V. 
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Then, 

z t+1 |z t - 7V p (,ulp + B (z t - (Ulp) , V) . 

Therefore, if z_i = x_ p: _i <~ A/p(/zl p ,yl) is Gaussian, then z t is Gaussian. 
Suppose that z t ~ N P {M 1 A), we get 

E [z t+ i) = filp + B (M — [ilp] 

and E [z t+1 ] = E [z t ] if 

Mlp + S(M-^l p ) = M , 
which means that M — [il p . Similarly, V (z t+ i) = V (z t ) if and only if 

BAB' + V = A , 
which is the "fixed point" equation satisfied by A. 



Exercise 7.7 Show that the posterior distribution on 9 associated with the prior 
7r(0) = l/cr 2 is well-defined for T > p observations. 




Warning: The prior tt(9) = l/a was wrongly used in the first printing of the 
book. 

The likelihood conditional on the initial values Xo-.( p -i) is proportional to 



-T+p- 



t= P 



exp 



- M - Bi(x t -i - M) /2c 2 



A traditional noninformative prior is ir((i, Q\, . . . , g pi a 2 ) = l/a 2 . In that case, 
the probability density of the posterior distribution is proportional to 



-T+p-3 



n exp 



x t 



M - 3t{xt-i - A*) /2er 2 



And 



( (T 2 ) -(T- P+ 3)/2-Q exp 



t=p 



X t - Qi{x t -i - /Lt) /2cr 2 f dcr 2 < oo 

V »=l / 



holds for T — p + 1 > 0, i.e., T > p — 1. This integral is equal to 



2 >> (p-T-l)/2 
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which is intcgrablc in \i for T — p > 0, i.e. T > p. The other parameters Qj 
(j = 1, . . . ,p0 being bounded, the remaining integrand is clearly integrable in 



Exercise 7.8 Show that the coefficients of the polynomial V can be de- 
rived in 0(p 2 ) time from the inverse roots Aj using the recurrence relations 

(i = 1,.. . ,p,j = 0,...,p) 

r = i , ¥i = v^ 1 - A^pi , 

where ip® = 1 and V] = for j > i, and setting Qj = —i\) v - {j = 1, . . . ,p). 



Warning: The useless sentence "Deduce that the likelihood is computable in 
0(Tp 2 ) time" found in the first printing of the book has been removed. 
Since 

v j 

i=l j=l 

we can expand the lhs one root at a time. If we set 



3=1 3=0 

' then 

i+l i 

H(i-\ jX )=(i-\ i+1 x)i[(i-\ j x) 

3=1 3=1 

i 

= (1 - A, 

= 1 + - A »+i^-i)^ - A 4 +i^^ +1 , 

i=i 

which establishes the — ipj — Ai+i?/;^ recurrence relation. 

This recursive process requires the allocation of i variables at the ith stage; 
the coefficients of V can thus be derived with a complexity of 0(p 2 ). 



Exercise 7.9 Show that, if the proposal on a 1 is a log-normal distribution 
£j\f(log(a 2 _ 1 ), t 2 ) and if the prior distribution on a 2 is the noninformative prior 
7r(cr 2 ) = 1/(T 2 , the acceptance ratio also reduces to the likelihood ratio because 
of the Jacobian. 
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Warning: In the first printing of the book, there is a log missing in the mean 
of the log-normal distribution. 

If we write the Metropolis-Hastings ratio for a current value a 2 an d a 
proposed value a 2 , we get 

A°XY{°\) exp (-(logK - log(a?)) 2 /2r 2 ) /<xg = 
7r(a 2 )% 2 ) exp (-(log(a 2 - log(a 2 )) 72r 2 ) /a 2 % 2 ) ' 

as indicated. 

Exercise 7.10 Write an R program that extends the reversible jump algorithm 
7.1 to the case when the order p is unknown and apply it to the same Ahold Kon. 
series of Eurostoxx 50. 



The modification is rather straightforward if one only considers birth and 
death moves, adding and removing real or complex roots in the polynomial. 
When the new values are generated from the prior, as in the program provided 
by #7.txt on the Webpage, the acceptance probability remains equal to the 
likelihood ratio (with the usual modifications at the boundaries). 

Exercise 7.11 For an MA(q) process, show that (s < q) 

1-\s\ 

7x(s) = cr 2 ^i+M • 

i=0 



We have 

7 x (s) = E[x t x t -a] 

= E [[e t + lCt _i + . . . + & q e t - q ] [e t -s + 0ie t _ 8 _i + . . . + & q e t -a- q ]] ■ 

Then, if 1 < s < q, 

lx{s) = [0, + $ s +l$l + . . . + <J 2 

and 

7 ,(0)= [l + d\ + ... + d 2 ] a 2 . 
Therefore, if (0 < s < q) with the convention that i?o = 1 

q-s 
i=0 

The fact that 7 x (s) = ^ x (— s) concludes the proof. 
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Exercise 7.12 Show that the conditional distribution of (e , . . . , given 
both xi : t and the parameters is a normal distribution. Evaluate the complexity 
of computing the mean and covariance matrix of this distribution. 



The distribution of xi : t conditional on (eo, . . . , is proportional to 



a T Y[ cxp < 



t=i 



X t — fi + ^ 1?j?t- 



2a 2 



Take 



(e , . . . , e_ 9+ i) - N q (0 q ,a 2 I q ) . 

In that case, the conditional distribution of (e , . . . , e_ g+ i) given xi-t is pro- 
portional to 

f[ cxp{- e ?/2a 2 }ncxp{-^Ax 2 }. 

i=-q+l t=l 

Due to the recursive definition of ej, the computation of the mean and the 
covariance matrix of this distribution is too costly to be available for realistic 
values of T. For instance, getting the conditional mean of ej requires deriving 
the coefficients of €i from all terms 

\ 2 
<i \ 

by exploiting the recursive relation 

e t = z t - fj, + ^ i?jei-j • 

If we write ei = Ji + and e t — 8 t + Pt^i, then we need to use the recursive 
formula 

g i 
S t = x t - (j, + ^2$jSt-j , (3 t = ^2f3t-j, 

before constructing the conditional mean of e^. The corresponding cost for this 
single step is therefore 0(Tq) and therefore 0(qT 2 ) for the whole series of e/s. 
Similar arguments can be used for computing the conditional variances. 



Exercise 7.13 Give the conditional distribution of e_t given the other e_i's, 
xi : t, and the ei's. Show that it only depends on the other e_j's, xi :g _ t+ i, and 

ei:g-t+l- 
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The formulation of the exercise is slightly too vague in that the e",'s are 
deterministic quantities based on the e_i's and x 1: ^. Thus, from a probabilistic 
point of view, the conditional distribution of e_ t only depends on the other 
e^i's and xi-t- However, from an algorithmic point of view, if we take the ei's 
as additional observables in (7.11), spotting in 



in the sum since, for t — q > —£, i.e. for t > q — £, e_£ does not appear in 
the distribution. (Again, this is a formal construct that does not account for 
the deterministic derivation of the ei's.) The conditional distribution of e_£ is 
then obviously a normal distribution. 

Exercise 7.14 Show that the predictive horizon for the MA(q) model is re- 
stricted to the first q future observations Xt+i- 



Obviously, due to the lack of correlation between x T+q+ j (j > 0) and x 1:T 
we have 

E [.t t+9+ i|xi :T ] = E [x T+q +i] = 
and therefore the MA(q) model has no predictive ability further than horizon 

q- 

Exercise 7.15 Show that, when the support y is finite and when (ytjten is 
stationary, the marginal distribution of x t is the same mixture distribution for all 
t's. Deduce that the same identifiability problem as in mixture models occurs in 
this setting. 



Since the marginal distribution of x t is given by 




leads to keep only 





t-l q 





where 7r is the stationary distribution of (yt), this is indeed a mixture distri- 
bution. Although this is not the fundamental reason for the unidcntifiability 
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of hidden Markov models, there exists an issue of label switching similar to 
the case of standard mixtures. 



Exercise 7.16 Write down the joint distribution of (yt,Xt)teN in (7.19) and 
deduce that the (observed) likelihood is not available in closed form. 



Recall that y ~ W(0, a 2 ) and, for t = 1, . . . , T, 

\ yt = mt-\ + °<h-i > 

\x t = /?e^/ 2 e t , 

where both e t and t* t are iid Af(0, 1) random variables. The joint distribution 
of (xi : T,yo:T) is therefore 

/ (X1:T, Y0:t) = / (xi :T |yO:T) / (yO:T) 

= IjX^I^ f(yo)f(yi\yo)---f(yT\yT-i) 

g^rj2 cxp|-^y t /2^J cxp ^ x\ exp(-</ t )^ 



(27T/3 2 

1 



(27TCT 2 ) 



2^(T+l)/2 



exp 



Due to the double exponential term cxp ^— 7^2 St=i x t ex P(~J/t)) 1 ^ is im- 
possible to find a closed-form of the integral in yo-.r- 



Exercise 7.17 Show that the counterpart of the prediction filter in the Markov- 
switching case is given by 



logp(xi :t ) = ^log 



^/(av|av_i,2/ r = i)<p r {i) 



where <p r (i) = F(y r = i|xi :r _i) is given by the recursive formula 

K 

<p r (i) oc ^Pj,/(a; r _i|a; r _2,2/r-i = j)v> r -iU) ■ 



Warning! There is a typo in the first printing of the book where <p r is defined 
conditional on xi :t _i instead of xi :r _i. 
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This exercise is more or less obvious given the developments provided in 
the book. The distribution of y r given the past values Xi :r _i is the marginal 
of (y r ,y r _i) given the past values xi :r _i: 

K 

F(y r = i|xi :t _i) = ^P(y r = i,Vr-l = j|xi :r -l) 

j=l 

K 

= ^2^(yr-i = i|xi :r _i)P(y r = i\y r -\ =j) 

K 

oc ^PjjP(y r _i =j,a; r _i|xi :r _2) 

= ^PjjP(y r _i = j, |xi :I ._ 2 )/(x r _i|x I ._ 2 ,yr-i = j), 
j=l 

which leads to the update formula for the if r (i)'- The marginal distribution 
xi : t is then derived by 

* 

P(*l:t) = ^(^|X 1:(T ._ 1} ) 
r=l 

= IlX! P ( 2/r - 1 = J' X r\ x l:r-l) 
r=l j=l 

r=l j=l 

with the obvious convention <f\(i) = 7Tj, if (7Ti, . . . , 7r K ) is the stationary dis- 
tribution associated with P = (pij ) . 
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Image Analysis 



Exercise 8.1 Draw an example with n = 5 points in M 2 such that the fc-nearest- 
neighbor relation is not symmetric. 

The first quadrant of Figure 8.2 in the book is already an illustration of an 
assymmetric neighborhood relation. Once two points are drawn, it is sufficient 
to find sequentialy new points that are closer to the latest than the earlier 
points. Figure |87T| illustrates this case in dimension one, with decreasing-radius 
circles to emphasize the assymetry: each point on the line is such that its right 
neighbor is its nearest neighbor and that it is the nearest neighbor of its left 
neighbor. 

Exercise 8.2 For a given pair (k,n) and a uniform distribution of x in [0,1] 3 , 
design a Monte Carlo experiment that evaluates the distribution of the size of the 
symmetrized fc-nearest-neighborhood. 

For a given pair (k,n), the Monte Carlo experiment produces iV random 
samples on [0, l] 3 , for instance as 

samp=matr ix (runif (3*n) , n , 3) 

compute the fc-nearest-neighborhood matrix for this sample, by 

disamp=as.matrix(dist(samp,up=T,diag=T)) #distance matrix 

neibr=t (apply (disamp , 1 , order) )[, -1] #k nearest neighbours 

knnbr=neibr [ , 1 : k] 
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Fig. 8.1. Sequence ol five points with assymetry in the nearest-neighbor relations. 



newnbr=matrix(0,n,n) #k nearest neighbours 

for (i in l:n) #indicator matrix 

newnbr [i ,knnbr [i , ] ] =1 

and compute the sizes of the symmetrized fc-nearest-ncighborhoods for those 
samples, by 

size [t ,] =apply (newnbr+t (newnbr) >0 , 1 , sum) 

ending up with an approximation to the distribution of the size over the 
samples. It is then possible to summarize the matrix size on an histogram 
as in Figure |8.2| 



Exercise 8.3 When y = (2/1,2/2)1 show that the joint pdf of y is given by 

Discuss the extension to the general case. {Indication: The extension is solved via 
the Hammersley-Clifford theorem, given in Section 8.3.1.) 
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Fig. 8.2. Histogram of the size of the symmetrized fc-nearest-neighborhoods when 
k = 3 and n = 200 based on N = 1000 simulations. 



This cxercice is a consequence of Exercice |3.21| we have that, for all 2/2's, 

/(yi|X k) = /(^i lz/2, X, g, k)/f(y 2 \yi, X, g, fc) 

//(yi|y2,X,/3,fc)//(y 2 |yi,X,/3,fe)d3/i 
= /(j/i \m , X, g, k)/f(y 2 \yi , X, g, fc) 
Eg =1 /(C 9 l2/2,X,AA;)//(y 2 |C S! X,/3,fc) ' 

since the support of y\ is finite. We can therefore conclude that 

/(y |x, p, k) = -= /(2/l|y2 ' X ' /3 ' fc) . 

T,Uf(C g \y2,X,f3,k)/f(y 2 \C g ,X,0,k) 

As suggested, the extension is solved via the Hammersley-Clifford theo- 
rem, given in (8.4). See Section 8.3.1 for details. 



Exercise 8.4 Find two conditional distributions f(x\y) and g(y\x) such that 
there is no joint distribution corresponding to both / and g. Find a necessary 
condition for / and g to be compatible in that respect, i.e. to correspond to a 
joint distribution on (x,y). 



As stated, this is a rather obvious question: if f(x\y) — 4yexp(— iyx) and 
if g{y\x) — 6xexp(— 6xy), there cannot be a joint distribution inducing these 
two conditionals. What is more interesting is that, if f(x\y) = Ay exp(— Ayx) 
and g{y\x) = 4xexp(— 4yx), there still is no joint distribution, despite the 
formal agreement between both conditionals: the only joint that would work 
has the major drawback that it has an infinite mass! 



Exercise 8.5 Using the Hammersley-Clifford theorem, show that the full condi- 
tional distributions given by (8.1) are compatible with a joint distribution. 
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Note: In order to expose the error made in the first printing in using the size 
of the symmetrized neighborhood, N k (i), we will compute the potential joint 
distribution based on the pseudo-conditional 

= Cjly-i, X, /3, fc)ocexp(/3£ l C] ( w ) / N k (i) j , 

even though it is defined for a fixed Nk(i) = in the book. 

It follows from (8.4) that, if there exists a joint distribution, it satisfies 



P(y |X, ft fc) oc n p( ^^ ^ _ _ ^ yh ^ ... )yn) X,/3,fc) ' 
Therefore, 

P(y|X,/?,fc) cxexp J /}£ * E [l,-^) - E^y*)] - 

[ i=l k( - > \e<i,£~ k i 

E MwO-Wtf)] 

is the candidate joint distribution. Unfortunately, if we now try to derive the 
conditional distribution of i/j from this joint, we get 



P(^ = C J |y_ t ,X,/3,fc)cxexp / 3J — E l yt {Vj)+ E 

[ feW t>j,t~hi Kj,t~ 



. N k (l) 



^ E ^(w)- E I fly| 



which differs from the orginal conditional if the iV^,(j)'s differ. In conclusion, 
there is no joint distribution if (8.1) is defined as in the first printing. Taking 
all the Nk(j)'s equal leads to a coherent joint distribution since the last line 
in the above equation cancels. 



Exercise 8.6 If a joint density 7r(yi, y n ) is such that the conditionals 
7r(y_j|2/j) never cancel on the supports of the marginals m_ i (y_ i ), show that the 
support of tt is equal to the cartesian product of the supports of the marginals. 



Warning! This exercise replaces the former version of Exercise 8.6 "If 
7r(xi, . . . ,x n ) is a density such that its full conditionals never cancel on its 
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support, characterize the support of 7r in terms of the supports of the marginal 
distributions. " 

Let us suppose that the support of n is not equal to the product of the 
supports of the marginals. (This means that the support of it is smaller than 
this product.) Then the conditionals 7r(y_i|yi) cannot be positive everywhere 
on the support of m(y_j). 



Exercise 8.7 Describe the collection of cliques C for an 8 neighbor neighborhood 
structure such as in Figure 8.7 on a regular nx m array. Compute the number of 
cliques. 




Fig. 8.3. Neighborhood relations between the points of a 4 x 4 regular grid for a 
8 neighbor neighborhood structure. 
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Exercise 8.8 Use the Hammersley-Clifford theorem to establish that (8.6) is the 
joint distribution associated with the above conditionals. Deduce that the Ising 
model is a MRF. 

Following the developments in Exercise |8.5| this is exactly the same prob- 
lem as for the distribution (8.1) with a fixed neighborhood structure and the 
use of j3 instead of /3/Nk- 

Exercise 8.9 Draw the function Z(/3) for a 3 x 5 array. Determine the com- 
putational cost of the derivation of the normalizing constant Z(f3) of (8.6) for a 
m x n array. 

The function Z(J3) is defined by 

which involves a summation over the set X of size 2 15 . The R code corre- 
sponding to this summation is 

neigh=f unction(i , j){ #Neighbourhood indicator function 
(i==j+l) I I (i==j-l) I I (i==j+5) I I (i==j-5) 

} 

zee=f unction(beta){ 
val=0 

array=rep (0,15) 

for (i in 1: (2~15-1)){ 

expterm=0 

for (j in 1:15) 

expterm=expterm+sum( (array==array [j] ) *neigh(i=l : 15 , j=j ) ) 

val=val+exp (beta*expterm) 
while (array [j]==l){ 
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array [j] =0 
j=j+l 

} 

array [j] =1 

} 

expterm=0 

for (j in 1:15) 

expterm=expterm+sum( (array==array [j] ) *neigh(i=l : 15 , j=j ) ) 

val=val+exp(beta*expterm) 

1/val 



} 



It produces the (exact) curve given in Figure 8.4 




Fig. 8.4. Plot of the function Z{0) for a 3 x 5 array with a four neighbor structure. 



In the case of a m x n array, the summation involves 2 mx ™ and each 
exponential term in the summation requires (m x n) 2 evaluations, which leads 
to a 0((m x n) 2 2 rnxn ) overall cost. 
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Exercise 8.10 For an n x to array X, if the neighborhood relation is based 
on the four nearest neighbors as in Figure 8.7, show that the Xij's for which 
(i + j) = 0(2) are independent conditional on the Xij's for which (i + j) = 1(2) 
(1 < i < n, 1 < j < to). Deduce that the update of the whole image can be 
done in two steps by simulating the pixels with even sums of indices and then the 
pixels with odd sums of indices. (This modification of Algorithm 8.2 is a version 
of the Swendsen-Wang algorithm.) 

Warning! This exercise replaces the former version of Exercise |8~T0| "For an 
nx to array X, if the neighborhood relation is based on the four nearest neigh- 
bors, show that the Xnij ' s are independent conditional on the £2-t-i,2j-i ' s 
(1 < i < n, 1 < j ; < to). Deduce that the update of the whole image can be 
done in two steps by simulating the pixels with even indices and then the pixels 
with odd indices" 

This exercise is simply illustrating in the simplest case the improvement 
brought by the Swendsen-Wang algorithm upon the Gibbs sampler for image 
processing. 

As should be obvious from Figure 8.7 in the book, the dependence graph 
between the nodes of the array is such that a given Xij is independent from 
all the other nodes, conditional on its four neighbours. When (i + j) = 0(2), 
the neighbours have indices such that (i + j) = 1(2), which establishes 
the first result. 

Therefore, a radical alternative to the node-by-node update is to run a 
Gibbs sampler with two steps: a first step that updates the nodes Xij with 
even (i+j)'s and a step that updates the nodes Xij with odd (i + j)'s. This is 
quite a powerful solution in that it achieves the properties of two-stage Gibbs 
sampling, as for instance the Markovianity of the subchains generated at each 
step (see Robert and Casella, 2004, Chapter 9, for details). 

Exercise 8.11 Determine the computational cost of the derivation of the nor- 
malizing constant of the distribution (8.7) for a to x n array and G different 
colors. 

Just as in Exercise |8.9[ finding the exact normalizing requires summing 
over all possible values of x, which involves G mxn terms. And each exponential 
term involves a sum over (to x n) 2 terms, even though clever programing of 
the neighborhood system may reduce the computational cost down to to x n. 
Overall, the normalizing constant faces a computing cost of at least 0(to x 
n x G mxn ). 
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Exercise 8.12 Use the Hammersley-Clifford theorem to establish that (8.7) is 
the joint distribution associated with the above conditionals. Deduce that the 
Potts model is a MRF. 



Similar to the resolution of Exercise |8.5[ using the Hammersley-Clifford 
representation (8.4) and defining an arbitrary order on the set X leads to the 
joint distribution 

exp |/3£ ie x V,. , I Xi = Xj + E J>l ,^ l K=x* } 

7r(x) (X 



exp 



\3~i,3<i 3~h3>* J~i,J>» 



exp < 13 \ ^2 l xi= Xj + 51 l **=">* t ~ %=*i 

[ \3~i,3<i 

6X P { @ 



3~i 



So we indeed recover a joint distribution that is compatible with the initial full 
conditionals of the Potts model. The fact that the Potts is a MRF is obvious 
when considering its conditional distributions. 



Exercise 8.13 Derive an alternative to Algorithm 8.3 where the probabilities in 
the multinomial proposal are proportional to the numbers of neighbors n UlS and 
compare its performance with those of Algorithm 8.3. 



In Step 2 of Algorithm 8.3, another possibility is to select the proposed 
value of x Ui from a multinomial distribution 

M G (l;n { *\ue),...,n%\uij S ) 

where n'p (ue) denotes the number of neighbors of u\ that take the value g. 
This is likely to be more efficient than a purely random proposal, especially 
when the value of (3 is high. 



Exer cise 8.14 Show that the Swendsen-Wang improvement given in Exercise 
8.10 also applies to the simulation of 7r(x|y, (3, a 2 , fi). 
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This is kind of obvious when considering that taking into account the 
values of the j/j's does not modify the dependence structure of the Potts model. 
Therefore, if there is a decomposition of the grid 2" into a small number of 
sub-grids li, ... ,1k such that all the points in lj are independent from one 
another given the other T/s, a k step Gibbs sampler can be proposed for the 
simulation of x. 



Exercise 8.15 Using a piecewise-linear interpolation of /(/?) based on the val- 
ues /(/3 1 ), . . . , f(/3 M ), with < ft < . . . < 0m = 2, give the explicit value of 
the integral 



for any pair < a < a\ < 2. 



This follows directly from the R program provided in #8.txt, with 



/' 



/(/?)d/J« J2 /(ftXft+i-ft), 

i,ao</3i<ctl 



with the appropriate corrections at the boundaries. 



Exercise 8.16 Show that the estimators x that minimize the posterior expected 
losses E[Li(x,x)|y)] and E[L 2 (x,x)|y)] are x MPM and x MAP , respectively. 



Since 

Li(x,x) = , 

iei 

the estimator x associated with L\ is minimising 



E 



and therefore, for every i e I, ^ minimizes P(xj ^ Xi), which indeed gives 
the MPM as the solution. Similarly, 

L 2 (x,x) = I x ^s 

leads to x as the solution to 

minE [l x #x|y] = minP (x ^ xly) , 

X X 

which means that x is the posterior mode. 
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Exercise 8.17 Determine the estimators x associated 


with two loss functions 


that penalize differently the classification errors, 




L 3 (x,x) = lx i =x j ^xiTtx j and L 4 (x,x) = 









Even though L3 and L4 are very similar, they enjoy completely different 
properties. In fact, L 3 is basically useless because x = (1, • • • ,1) is always an 
optimal solution! 

If we now look at L4, we first notice that this loss function is invariant by 
permutation of the classes in x: all that matters are the groups of components 
of x taking the same value. Minimizing this loss function then amounts to 
finding a clustering algorithm. To achieve this goal, we first look at the dif- 
ference in the risks when allocating an arbitrary Xi to the value o and when 
allocating to the value b. This difference is equal to 

J2 p(a* = *,•)- E n^ = Xj ). 

3,Xj=a j,Xj=b 

It is therefore obvious that, for a given configuration of the other x/s, we 
should pick the value a that minimizes the sum J2j - ._ o P(xj = Xj). Once Xi 
is allocated to this value, a new index I is to be chosen for possible realloca- 
tion until the scheme has reached a fixed configuration, that is, no Xi need 
reallocation. 

This scheme produces a smaller risk at each of its steps so it does neces- 
sarily converge to a fixed point. What is less clear is that this produces the 
global minimum of the risk. An experimental way of checking this is to run 
the scheme with different starting points and to compare the final values of 
the risk. 



Exercise 8.18 Since the maximum of 7r(x|y) is the same as that of 7r(x|y) K 
for every k € N, show that 

^(x|y) K = Jn{x,e 1 \y)d9 1 x---x J n(x,0 K \y)de K (8.10) 

where 0j = /Xj, of) (1 < i < n). Deduce from this representation an opti- 
mization scheme that slowly increases k over iterations and that runs a Gibbs 
sampler for the integrand of (8.10) at each iteration. 



The representation (8.10) is obvious since 
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= J^0 1 \y)d6 1 x ••• x J ir(x,6 K \y)d6 K 

given that the symbols Oi within the integrals are dummies. 

This is however the basis for the so-called SAME algorithm of Doucet, 
Godsill and Robert (2001), described in detail in Robert and Casella (2004). 
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